Getting Started

Eä is a compute kernel compiler. You write small, focused compute functions, compile them to native code (.so/.dll), and call them from Python, Rust, C++, or PyTorch via C ABI.

No runtime, no GC, no standard library. Just kernels.

Quick start

pip install ea-compiler

Write a kernel (scale.ea):

export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
    let mut i: i32 = 0
    while i < n {
        dst[i] = src[i] * factor
        i = i + 1
    }
}

Call it from Python:

import ea
import numpy as np

kernel = ea.load("scale.ea")
src = np.random.randn(1_000_000).astype(np.float32)
dst = np.empty_like(src)
kernel.scale(src, dst, factor=2.0)

ea.load() compiles your kernel to a native shared library, caches it, and gives you a callable Python function. No Rust, no LLVM, no build step.

Next steps

Installation

pip install ea-compiler

This gives you the ea compiler and the Python ea.load() API. No other dependencies needed (besides NumPy).

Works on:

  • Linux x86_64
  • Linux aarch64 (ARM)
  • Windows x86_64

Verify installation

import ea
print(ea.__version__)           # e.g., "1.7.0"
print(ea.compiler_version())    # same, from the bundled binary

Building from source

For development or unsupported platforms, see the eacompute README for instructions on building the compiler from source. This requires Rust and LLVM 18.

Your First Kernel

Let's write a kernel that scales an array by a constant factor, then upgrade it to use SIMD.

Step 1: Scalar version

Create scale.ea:

export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
    let mut i: i32 = 0
    while i < n {
        dst[i] = src[i] * factor
        i = i + 1
    }
}

Key things to notice:

  • export makes the function callable from Python (C ABI)
  • *f32 is an immutable pointer to float32 — your input array
  • *mut f32 is a mutable pointer — your output array
  • n: i32 is the array length — the caller provides this
  • All types are explicit. No inference, no ambiguity.

Step 2: Call from Python

import ea
import numpy as np

kernel = ea.load("scale.ea")

src = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32)
dst = np.empty_like(src)
kernel.scale(src, dst, factor=2.0)
print(dst)  # [2. 4. 6. 8.]

ea.load() compiles scale.ea the first time, then caches the result in __eacache__/. Subsequent calls load from cache instantly.

Step 3: SIMD version

Now let's process 8 floats at a time:

export func scale_simd(src: *f32, dst: *mut f32, factor: f32, n: i32) {
    let s: f32x8 = splat(factor)
    let mut i: i32 = 0
    while i < n {
        let v: f32x8 = load(src, i)
        store(dst, i, v .* s)
        i = i + 8
    }
}

What changed:

  • f32x8 — a vector of 8 floats (256-bit, uses AVX2 on x86)
  • splat(factor) — broadcasts the scalar to all 8 lanes
  • load(src, i) — loads 8 consecutive floats starting at index i
  • v .* s — element-wise multiply (the dot prefix . means "vector operation")
  • store(dst, i, ...) — writes 8 floats back
  • We increment by 8, not 1

Important: This only works when n is a multiple of 8. For arbitrary lengths, see Kernels for tail-handling strategies.

Step 4: Compare performance

import ea
import numpy as np
import time

kernel = ea.load("scale.ea")
src = np.random.randn(10_000_000).astype(np.float32)
dst = np.empty_like(src)

# Eä scalar
start = time.perf_counter()
for _ in range(100):
    kernel.scale(src, dst, factor=2.0)
ea_scalar = (time.perf_counter() - start) / 100

# Eä SIMD
start = time.perf_counter()
for _ in range(100):
    kernel.scale_simd(src, dst, factor=2.0)
ea_simd = (time.perf_counter() - start) / 100

# NumPy
start = time.perf_counter()
for _ in range(100):
    np.multiply(src, 2.0, out=dst)
numpy_time = (time.perf_counter() - start) / 100

print(f"Eä scalar: {ea_scalar*1000:.2f} ms")
print(f"Eä SIMD:   {ea_simd*1000:.2f} ms")
print(f"NumPy:     {numpy_time*1000:.2f} ms")

For this simple operation, all three will be similar — it's bandwidth-bound (one operation per element loaded). Eä shines on compute-bound workloads where there are multiple operations per element. See the Cookbook for real-world comparisons.

Why Eä

What Eä is

Eä is a compute kernel compiler. You write small, focused numerical routines in Eä's explicit syntax, compile them to native shared libraries (.so on Linux, .dll on Windows), and call them from Python, Rust, C++, or PyTorch via C ABI. Eä is not a general-purpose programming language. It has no standard library, no garbage collector, no runtime. It compiles kernels.

The problem

Python is slow for tight numerical loops. When you need to run a custom algorithm over millions of elements -- a stencil, a reduction with custom logic, a chain of fused multiply-adds -- you hit a wall. The usual options are:

  • NumPy: fast for operations it supports, but custom multi-step algorithms require chaining calls that allocate intermediates and re-scan memory.
  • Numba: JIT-compiles Python, but debugging is difficult, compilation is unpredictable, and SIMD vectorization is implicit (you hope the compiler figures it out).
  • Cython: requires learning a hybrid language, managing build systems, and still gives you limited control over vectorization.
  • Writing C extensions: full control, but high friction. Manual Python/C bridging, header files, build scripts.

Eä targets the gap: you want native-speed kernels with explicit SIMD, but you do not want to maintain a C build system.

The philosophy

Eä is built on one principle: explicit over implicit.

  • If you write f32x8, you get 8-wide SIMD. The compiler will not silently fall back to scalar code.
  • If hardware does not support an operation (e.g., scatter without AVX-512), the compiler errors. It does not emit slow scalar code behind your back.
  • All types are explicit. No type inference. You see exactly what is happening.
  • All memory is caller-provided. Eä kernels never allocate. Pointers come from the host language.

There are no hidden performance cliffs. The code you write is the code that runs.

When to use Eä

Eä excels at compute-bound workloads where you do significant work per element loaded from memory:

  • Stencil operations: convolutions, blurs, edge detection -- each output element reads from multiple inputs.
  • Fused multiply-add chains: polynomial evaluation, IIR filters, dot products.
  • Custom reductions: computing statistics, finding patterns, accumulating results with non-trivial logic.
  • Particle simulations: N-body interactions, force calculations.
  • Image processing pipelines: per-pixel math with multiple operations fused into one pass.

The common thread: you load data, do many arithmetic operations on it, then store results. The CPU spends most of its time computing, not waiting for memory.

When NOT to use Eä

Eä is the wrong tool for:

  • Bandwidth-bound workloads: if you are just adding two arrays element-wise, NumPy already saturates memory bandwidth. Eä cannot make memory faster.
  • General programming: no strings, no file I/O, no networking, no data structures beyond structs.
  • Prototyping: write your algorithm in Python first, profile it, then port the hot loop to Eä.
  • GPU workloads: Eä targets CPUs (x86-64 with AVX2/AVX-512, AArch64 with NEON).

A good rule of thumb: if your inner loop does fewer than 4 arithmetic operations per element loaded, NumPy is probably fast enough.

The compilation model

kernel.ea  -->  ea --lib  -->  kernel.so + kernel.ea.json
                                   |
                          ea bind --python
                                   |
                              kernel.py  (generated wrapper)

One .ea file is one compilation unit. There are no imports, no modules. If you need to compose kernels, you do it at the C level -- each kernel is an independent shared library with a C ABI entry point.

The generated .ea.json metadata file describes the function signatures. The ea bind command reads it to generate idiomatic wrappers for your target language.

Quick taste

Here is a complete Eä kernel that scales an array of floats using 8-wide SIMD:

export kernel vscale(data: *f32, out: *mut f32, factor: f32)
    over i in n step 8
    tail scalar {
        out[i] = data[i] * factor
    }
{
    let vf: f32x8 = splat(factor)
    store(out, i, load(data, i) .* vf)
}

Compile and use from Python:

ea kernel.ea --lib
ea bind kernel.ea --python
import numpy as np
from kernel import vscale

data = np.array([1.0, 2.0, 3.0, 4.0, 5.0], dtype=np.float32)
result = vscale(data, factor=3.0)
# result: [3.0, 6.0, 9.0, 12.0, 15.0]

The main body processes 8 elements at a time with SIMD. The tail scalar block handles any remainder elements one at a time. The n parameter (the array length) is automatically injected into the function signature from the over i in n clause.

Language Basics

This page covers Eä's scalar language features. For SIMD vector types and operations, see SIMD.

Scalar types

Eä has fixed-size numeric types. No type inference -- you always write the type explicitly.

TypeDescription
i8, i16, i32, i64Signed integers
u8, u16, u32, u64Unsigned integers
f32, f64Floating point
boolBoolean (true / false)

Variables

All variables must have an explicit type annotation. Variables are immutable by default.

let x: i32 = 5
let y: f32 = 3.14
let flag: bool = true

To make a variable mutable, use mut:

let mut counter: i32 = 0
counter = counter + 1

Constants

Compile-time constants use const:

const PI: f32 = 3.14159
const BATCH_SIZE: i32 = 256
const EPSILON: f64 = 1e-9

Constants can be used in static_assert for compile-time checks:

const STEP: i32 = 8
static_assert(STEP > 0, "step must be positive")

Arithmetic and comparison

Standard arithmetic operators work on all numeric types:

let a: i32 = 10 + 3    // 13
let b: i32 = 10 - 3    // 7
let c: i32 = 10 * 3    // 30
let d: i32 = 10 / 3    // 3 (integer division)
let e: i32 = 10 % 3    // 1 (remainder)

Comparison operators return bool:

let lt: bool = a < b
let gt: bool = a > b
let le: bool = a <= b
let ge: bool = a >= b
let eq: bool = a == b
let ne: bool = a != b

Logical operators

Eä uses words, not symbols, for logical operations:

let both: bool = a > 0 and b > 0
let either: bool = a > 0 or b > 0
let neither: bool = not (a > 0 or b > 0)

Control flow

if / else if / else

if x > 0 {
    println(1)
} else if x == 0 {
    println(0)
} else {
    println(-1)
}

while loops

let mut i: i32 = 0
while i < n {
    out[i] = data[i] * 2
    i = i + 1
}

for loops

Counted loops with an explicit step:

for i in 0..n step 1 {
    out[i] = data[i] * 2
}

The step is required. The range 0..n is half-open: it includes 0 but excludes n.

foreach loops

A simpler counted loop when the step is always 1:

foreach (i in 0..n) {
    out[i] = data[i] * 2
}

Loop unrolling

Wrap a loop in unroll(N) to unroll it at compile time:

unroll(4) {
    foreach (j in 0..4) {
        out[base + j] = data[base + j] * factor
    }
}

Functions

Functions are declared with func. All parameter and return types are explicit:

func square(x: f32) -> f32 {
    return x * x
}

func add(a: i32, b: i32) -> i32 {
    return a + b
}

Functions without a return type return nothing (void):

func fill(out: *mut i32, n: i32, val: i32) {
    foreach (i in 0..n) {
        out[i] = val
    }
}

Exported functions

To make a function callable from C/Python/Rust, prefix it with export:

export func dot_product(a: *f32, b: *f32, n: i32) -> f32 {
    let mut sum: f32 = 0.0
    foreach (i in 0..n) {
        sum = sum + a[i] * b[i]
    }
    return sum
}

Only export func (and export kernel) produce symbols visible from outside. Non-exported functions are internal helpers.

Pointers

Pointers are how kernels receive data from the host language. There are four pointer variants:

SyntaxMeaning
*TRead-only pointer
*mut TMutable pointer (can write through it)
*restrict TRead-only, no aliasing (enables optimizations)
*restrict mut TMutable, no aliasing

Pointer indexing

Read from a pointer with bracket indexing:

let val: f32 = data[i]    // data is *f32

Write through a mutable pointer:

out[i] = val              // out is *mut f32

Type casts

Explicit casts convert between numeric types:

let x: i32 = 42
let f: f32 = to_f32(x)       // 42.0
let d: f64 = to_f64(x)       // 42.0
let back: i32 = to_i32(f)    // 42
let wide: i64 = to_i64(x)    // 42

There are no implicit conversions. Mixing types without a cast is a compile error.

println

println is the only output primitive. It exists for debugging:

println(42)
println(3.14)
println(true)
println("hello")

It accepts integers, floats, bools, and string literals. It does not support format strings.

What does not exist

Eä is deliberately minimal. The following features do not exist and are not planned:

  • No generics or templates
  • No traits or interfaces
  • No modules or imports
  • No heap allocation
  • No strings (except literal arguments to println)
  • No semicolons (statements are newline-separated)
  • No closures or lambdas
  • No enums or pattern matching
  • No exceptions or error handling

One file, one compilation unit. Compose at the C level.

SIMD

SIMD (Single Instruction, Multiple Data) lets you process multiple values in a single CPU instruction. Eä gives you direct control over SIMD vectors -- what you write is what the CPU executes.

Vector types

A vector type holds a fixed number of elements of the same scalar type. The name encodes both the element type and the lane count:

TypeElementsBitsx86 requirementARM requirement
f32x44 x f32128SSENEON
i32x44 x i32128SSENEON
u8x1616 x u8128SSENEON
i8x1616 x i8128SSENEON
i16x88 x i16128SSENEON
f64x22 x f64128SSE2NEON
f32x88 x f32256AVX2not available
i32x88 x i32256AVX2not available
u8x3232 x u8256AVX2not available
f64x44 x f64256AVX2not available
f32x1616 x f32512AVX-512not available

If you use f32x8 on a machine without AVX2, or on ARM, the compiler will error. No silent scalar fallback.

To enable 512-bit vectors, compile with --avx512:

ea kernel.ea --lib --avx512

Dot operators

Element-wise vector operations use dot-prefixed operators. This distinguishes them from scalar operations and makes SIMD explicit in the source:

OperatorMeaning
.+Element-wise add
.-Element-wise subtract
.*Element-wise multiply
./Element-wise divide
.>Element-wise greater than (returns bool vector)
.<Element-wise less than
.>=Element-wise greater or equal
.<=Element-wise less or equal
.==Element-wise equal
.!=Element-wise not equal
.&Element-wise bitwise AND
.|Element-wise bitwise OR
.^Element-wise bitwise XOR
.<<Element-wise shift left
.>>Element-wise shift right (logical for unsigned, arithmetic for signed)

Example:

let a: f32x4 = load(data, i)
let b: f32x4 = load(data, i + 4)
let sum: f32x4 = a .+ b
let product: f32x4 = a .* b
let mask: f32x4 = a .> b

Creating vectors

splat

Broadcast a scalar value to all lanes:

let factor: f32 = 2.5
let vf: f32x8 = splat(factor)
// vf = [2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5]

Loading from memory

Load a vector from a pointer at an element offset:

let v: f32x8 = load(data, i)     // loads data[i..i+8]

The offset i is in elements, not bytes. load(data, 8) loads elements 8 through 15.

There are also typed load intrinsics when you want to be explicit about the element type:

let v: f32x8 = load_f32(data, i)
let v: f32x4 = load_f32x4(data, i)

Storing to memory

Write a vector to a mutable pointer:

store(out, i, result)     // writes result to out[i..i+8]

Element access

Read individual lanes from a vector by index:

let v: f32x4 = load(data, 0)
let first: f32 = v[0]
let second: f32 = v[1]

Conditional selection

select picks lanes from two vectors based on a mask:

let mask: f32x4 = a .> b
let result: f32x4 = select(mask, a, b)
// where a > b, take a; otherwise take b

This compiles to a single blend instruction. There is no branching.

Scalar vs SIMD comparison

Here is the same operation -- scaling an array -- written both ways.

Scalar (processes one element per iteration):

export func scale(data: *f32, out: *mut f32, factor: f32, n: i32) {
    foreach (i in 0..n) {
        out[i] = data[i] * factor
    }
}

SIMD (processes 8 elements per iteration):

export kernel vscale(data: *f32, out: *mut f32, factor: f32)
    over i in n step 8
    tail scalar {
        out[i] = data[i] * factor
    }
{
    let vf: f32x8 = splat(factor)
    store(out, i, load(data, i) .* vf)
}

The SIMD version does 8 multiplications in a single instruction. On a workload that is compute-bound (multiple operations per element), this translates to a proportional speedup.

Hardware targeting

By default, Eä targets AVX2 on x86-64 and NEON on AArch64. You can change this:

# Default (AVX2 on x86)
ea kernel.ea --lib

# Enable AVX-512
ea kernel.ea --lib --avx512

# Target a specific CPU
ea kernel.ea --lib --target=skylake

# Cross-compile for ARM
ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu

The compiler rejects code that requires features the target does not have. If you use f32x8 and target ARM, you get a compile error, not slow code.

Kernels

The kernel construct is Eä's main abstraction for writing vectorized loops with automatic tail handling. It is syntactic sugar -- the compiler transforms it into a plain function with a while-loop before any further compilation.

Basic syntax

export kernel name(params)
    over i in n step S
    tail strategy { tail_body }
{
    main_body
}
  • name: the kernel's name, becomes the C ABI symbol
  • params: function parameters (pointers, scalars)
  • over i in n: the loop variable i iterates from 0 to n
  • step S: how many elements the main body processes per iteration
  • tail strategy: how to handle remainder elements when n is not a multiple of S
  • main_body: the code that runs for each full chunk of S elements

The range variable (here n) is automatically injected as an i32 parameter into the function signature. You do not declare it in the parameter list.

How it desugars

A kernel like this:

export kernel scale(data: *f32, out: *mut f32, factor: f32)
    over i in n step 4
    tail scalar { out[i] = data[i] * factor }
{
    out[i] = data[i] * factor
    out[i + 1] = data[i + 1] * factor
    out[i + 2] = data[i + 2] * factor
    out[i + 3] = data[i + 3] * factor
}

becomes equivalent to:

export func scale(data: *f32, out: *mut f32, factor: f32, n: i32) {
    let mut i: i32 = 0
    while i + 4 <= n {
        out[i] = data[i] * factor
        out[i + 1] = data[i + 1] * factor
        out[i + 2] = data[i + 2] * factor
        out[i + 3] = data[i + 3] * factor
        i = i + 4
    }
    // tail: process remainder one at a time
    while i < n {
        out[i] = data[i] * factor
        i = i + 1
    }
}

The generated C signature is void scale(float*, float*, float, int) -- the n parameter appears last.

Tail strategies

The tail handles remainder elements when n is not evenly divisible by the step.

tail scalar

Process remainder elements one at a time. The tail body runs in a loop with step 1:

export kernel vscale(data: *f32, out: *mut f32, factor: f32)
    over i in n step 8
    tail scalar {
        out[i] = data[i] * factor
    }
{
    let vf: f32x8 = splat(factor)
    store(out, i, load(data, i) .* vf)
}

The main body uses 8-wide SIMD. The tail body is scalar code that handles 0 to 7 leftover elements. This is the most common tail strategy.

tail mask

Use masked load/store for the remainder. The tail body runs once (not in a loop) and must handle all remaining elements using masked operations:

export kernel vscale(data: *f32, out: *mut f32, factor: f32)
    over i in len step 4
    tail mask {
        let rem: i32 = len - i
        let vf: f32x4 = splat(factor)
        let v: f32x4 = load_masked(data, i, rem)
        store_masked(out, i, v .* vf, rem)
    }
{
    let vf: f32x4 = splat(factor)
    store(out, i, load(data, i) .* vf)
}

The rem variable tells the masked operations how many elements are valid. This avoids the scalar loop entirely but requires masked intrinsics.

tail pad

The caller guarantees the input length is a multiple of the step. No tail body is generated:

export kernel fill(out: *mut i32, val: i32)
    over i in n step 4
    tail pad
{
    out[i] = val
    out[i + 1] = val
    out[i + 2] = val
    out[i + 3] = val
}

This produces the most efficient code but shifts the responsibility to the caller. If n is not a multiple of the step, the kernel will skip the remaining elements.

No tail clause

If you omit the tail clause entirely, the kernel only runs the main body. Remaining elements are not processed:

export kernel double_it(data: *i32, out: *mut i32)
    over i in n step 1
{
    out[i] = data[i] * 2
}

With step 1, there is no remainder, so omitting the tail is safe. With larger steps, you must ensure n is always a multiple of the step, or accept that trailing elements are skipped.

Complete example

A SIMD dot product kernel that handles any input length:

export kernel dot(a: *f32, b: *f32, out: *mut f32)
    over i in n step 8
    tail scalar {
        out[0] = out[0] + a[i] * b[i]
    }
{
    let va: f32x8 = load(a, i)
    let vb: f32x8 = load(b, i)
    let products: f32x8 = va .* vb
    let sum: f32 = reduce_add(products)
    out[0] = out[0] + sum
}

The main body loads 8-wide vectors, multiplies them, reduces to a scalar sum, and accumulates into out[0]. The scalar tail handles any remaining 0-7 elements.

Structs

Structs in Eä are plain data containers with C-compatible memory layout. They have no methods, no constructors, no impl blocks. They exist so you can pass structured data between Eä kernels and host languages.

Defining a struct

struct Particle {
    x: f32,
    y: f32,
    mass: f32,
}

Fields can be any scalar type. The memory layout matches C struct layout, so a Particle in Eä is identical to:

typedef struct { float x; float y; float mass; } Particle;

Creating and accessing

Inside Eä functions, you create a struct with literal syntax and access fields with dot notation:

func main() {
    let p: Particle = Particle { x: 1.0, y: 2.0, mass: 10.0 }
    println(p.x)      // 1
    println(p.mass)    // 10
}

Mutable structs support field assignment:

func main() {
    let mut p: Particle = Particle { x: 0.0, y: 0.0, mass: 1.0 }
    p.x = 3.5
    p.y = 7.0
    println(p.x)    // 3.5
}

Struct pointers

In exported functions, structs are typically passed via pointer from the host language:

struct Point {
    x: f32,
    y: f32,
}

export func get_x(p: *Point) -> f32 {
    return p.x
}

export func set_point(p: *mut Point, nx: f32, ny: f32) {
    p.x = nx
    p.y = ny
}

Read-only pointers (*Point) allow field reads. Mutable pointers (*mut Point) allow field writes.

Arrays of structs

Pointer indexing works with struct arrays. Each element is a full struct:

struct Vec2 {
    x: f32,
    y: f32,
}

export func sum_x(vecs: *Vec2, n: i32) -> f32 {
    let mut total: f32 = 0.0
    let mut i: i32 = 0
    while i < n {
        total = total + vecs[i].x
        i = i + 1
    }
    return total
}

From C, this is called with a pointer to a contiguous array of structs:

typedef struct { float x; float y; } Vec2;
extern float sum_x(const Vec2*, int);

Vec2 vecs[] = { {1.0f, 10.0f}, {2.0f, 20.0f}, {3.0f, 30.0f} };
float result = sum_x(vecs, 3);  // 6.0

Passing from Python

Since Eä structs match C layout, you can use NumPy structured arrays or ctypes:

import numpy as np

particle_dtype = np.dtype([
    ('x', np.float32),
    ('y', np.float32),
    ('mass', np.float32),
])

particles = np.zeros(1000, dtype=particle_dtype)

The generated Python bindings handle the pointer passing automatically.

Limitations

  • No methods or impl blocks. Structs are data only.
  • No nested structs.
  • No generics. Write separate struct definitions for each concrete type.
  • Struct fields must be scalar types.

Common Intrinsics

This page covers the most frequently used intrinsics. For the complete list, see the Intrinsics Reference.

splat

Broadcast a scalar to all lanes of a vector:

let factor: f32 = 2.5
let vf: f32x8 = splat(factor)
// vf = [2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5]

Works with all vector types. The return type is inferred from the variable's type annotation.

load / store

Load a vector from a pointer at an element offset. Store writes a vector back:

let v: f32x8 = load(data, i)       // read 8 floats starting at data[i]
store(out, i, v)                    // write 8 floats starting at out[i]

Offsets are in elements, not bytes.

Typed loads

When you want to be explicit about the element type:

let v: f32x8 = load_f32(data, i)
let v: f32x4 = load_f32x4(data, i)

These are equivalent to plain load but make the element type visible in the source.

stream_store

Non-temporal store — bypasses cache, used for write-only output. Vector or scalar value (v1.15.0 added scalar i16/u16/i32/u32/i64/u64). See reference for the full alignment contract, ordering contract, and anti-patterns.

stream_store(out, i, result)         // vector
stream_store(out, i, scalar_value)   // scalar (v1.15.0+)

fma

Fused multiply-add: computes a * b + c in a single instruction with a single rounding (more accurate than separate multiply and add):

// Scalar
let result: f32 = fma(a, b, c)

// Vector
let va: f32x8 = load(a, i)
let vb: f32x8 = load(b, i)
let vc: f32x8 = load(c, i)
let result: f32x8 = fma(va, vb, vc)

Works on f32, f64, and all float vector types. Maps to the hardware FMA instruction.

reduce_add

Sum all lanes of a vector down to a scalar:

let v: f32x8 = load(data, i)
let sum: f32 = reduce_add(v)

Works on all integer and float vector types. Useful for dot products, reductions, and histogram accumulation.

reduce_max / reduce_min

Find the maximum or minimum value across all lanes:

let v: f32x4 = load(data, i)
let biggest: f32 = reduce_max(v)
let smallest: f32 = reduce_min(v)

Works on integer and float vector types.

select

Per-lane conditional: where the mask is true, take from a; where false, take from b:

let mask: f32x8 = a .> b
let result: f32x8 = select(mask, a, b)
// element-wise max(a, b)

Compiles to a blend instruction. No branching. This is how you write branchless SIMD code.

sqrt / rsqrt

Square root and reciprocal square root (1/sqrt):

let x: f32 = 16.0
let root: f32 = sqrt(x)       // 4.0

let v: f32x8 = load(data, i)
let roots: f32x8 = sqrt(v)
let inv_roots: f32x8 = rsqrt(v)

rsqrt uses the fast hardware approximation. Works on f32, f64, and float vector types.

min / max

Element-wise minimum and maximum. Works on both scalars and vectors:

// Scalar
let smaller: f32 = min(a, b)
let larger: f32 = max(a, b)

// Vector
let va: f32x8 = load(data, i)
let vb: f32x8 = load(data, j)
let mins: f32x8 = min(va, vb)
let maxs: f32x8 = max(va, vb)

movemask

Extract comparison results to an integer bitmask. x86 only:

let mask: f32x8 = a .> b
let bits: i32 = movemask(mask)
// bit k is 1 if lane k of a > lane k of b

Useful for branching on SIMD comparison results or counting matching elements. Each bit in the result corresponds to one vector lane.

sat_add / sat_sub

Saturating addition and subtraction. Values clamp to the type's min/max instead of wrapping on overflow. Cross-platform (ARM NEON + x86 SSE2):

let bright: u8x16 = sat_add(pixels, boost)    // clamps at 255
let dark: u8x16 = sat_sub(pixels, reduce)     // clamps at 0

Works with i8x16, u8x16, i16x8, u16x8. Signed vs unsigned saturation is determined by the element type. Both arguments must have the same type.

Masked memory operations

For tail handling, masked loads and stores read/write only the valid lanes:

let rem: i32 = n - i
let v: f32x4 = load_masked(data, i, rem)    // load only 'rem' elements
store_masked(out, i, result, rem)            // store only 'rem' elements

The rem parameter specifies how many elements (starting from lane 0) are valid. Lanes beyond rem are zero-filled on load and not written on store.

Summary table

IntrinsicInputOutputDescription
splat(s)scalarvectorBroadcast to all lanes
load(ptr, i)pointer, offsetvectorLoad vector from memory
store(ptr, i, v)pointer, offset, vectorvoidWrite vector to memory
stream_store(ptr, i, v)pointer, offset, vector or scalarvoidNon-temporal write
fence_nt()nonevoidStore-store barrier for stream_store ordering
fma(a, b, c)3 valuessame typea * b + c fused
reduce_add(v)vectorscalarSum all lanes
reduce_max(v)vectorscalarMax across lanes
reduce_min(v)vectorscalarMin across lanes
select(m, a, b)mask, 2 vectorsvectorPer-lane conditional
sqrt(x)scalar/vectorsame typeSquare root
rsqrt(x)scalar/vectorsame typeReciprocal square root
min(a, b)2 valuessame typeElement-wise minimum
max(a, b)2 valuessame typeElement-wise maximum
sat_add(a, b)2 int vectorssame typeSaturating addition
sat_sub(a, b)2 int vectorssame typeSaturating subtraction
movemask(m)bool vectori32Extract lane mask to bits

Specification

This page is the Eä language and library specification for eacompute. It is the normative reference for what the compiler accepts and the semantics of every built-in operation.

The specification is structured as the following parts:

  • Type System — scalar, vector, pointer, and struct types; type rules including integer/float literal defaults and forbidden implicit conversions.
  • All Intrinsics — every built-in function: memory (load/store/gather), math, reduction, vector, conversion, debug.
  • CLI Reference — the ea driver: compile, bind, inspect, print-target; target-feature flags (--avx512, --fp16, --i8mm, --dotprod); cross-compilation triple syntax.
  • ARM / NEON — AArch64-specific intrinsic surface, vector-width constraints, cross-compilation recipe, and the portable-kernels pattern.
  • Binding Annotationsout qualifier, length-collapse via [cap: n], and the .ea.json metadata schema consumed by language bindings.
  • Python API — the ea bind --python-generated module surface.
  • ea bench — manifest schema, harness JSONL contract, baseline diff semantics.

Normative vs informative

The seven documents above are normative: the compiler's behavior is what they describe, and divergence is a bug in either the compiler or the spec.

The Guide and the Cookbook are informative — teaching material and worked examples. Where they conflict with the spec, the spec wins.

Stability and deprecation

The intrinsic and library API surface follows a deprecation cycle:

  1. Minor release. The old name continues to work and emits a deprecation warning at every call site. The replacement is registered in src/typeck/deprecations.rs, and a migration entry lands in the upcoming-major-release file under docs/migrations/.
  2. Major release. The old name is removed. Callers who ignored the warning now get an unknown intrinsic error; the migration file is the canonical recipe.

The cargo public-api CI gate prevents accidental Rust-side public-API drift (snapshot at docs/public-api.txt).

Version

This specification reflects eacompute on main. For the spec at a specific tagged version, browse the same files from that tag.

Type System

Eä is statically typed with no implicit conversions. Every variable, parameter, and expression has a concrete type known at compile time.

Scalar Types

TypeSizeDescription
i81 byteSigned 8-bit integer
u81 byteUnsigned 8-bit integer
i162 bytesSigned 16-bit integer
u162 bytesUnsigned 16-bit integer
i324 bytesSigned 32-bit integer
u324 bytesUnsigned 32-bit integer
i648 bytesSigned 64-bit integer
u648 bytesUnsigned 64-bit integer
f324 bytes32-bit float (IEEE 754)
f648 bytes64-bit float (IEEE 754)
bool1 byteBoolean (true or false)

Integer literals default to i32. Float literals default to f32. Use explicit casts (to_f64(x), to_i64(x)) to convert between types.

Vector Types

Vector types hold multiple lanes of the same scalar type. Element-wise operations use dot-operators (.+, .-, .*, ./, etc.).

128-bit Vectors -- SSE (x86) / NEON (ARM)

TypeLanesElementSize
f32x44f3216 bytes
f64x22f6416 bytes
i32x44i3216 bytes
u32x44u3216 bytes
i16x88i1616 bytes
u16x88u1616 bytes
i8x1616i816 bytes
u8x1616u816 bytes
u64x22u6416 bytes

256-bit Vectors -- AVX2 (x86 only)

TypeLanesElementSize
f32x88f3232 bytes
f64x44f6432 bytes
i32x88i3232 bytes
i16x1616i1632 bytes
u16x1616u1632 bytes
i8x3232i832 bytes
u8x3232u832 bytes
u64x44u6432 bytes

These types produce a compile error on ARM targets.

512-bit Vectors -- AVX-512 (x86, --avx512 flag required)

f32x16, f64x8, i32x16, and u64x8 lower to AVX-512F instructions. i8x64, u8x64, i16x32, and u16x32 additionally require AVX-512BW (byte/word SIMD), which is present on Skylake-SP, Ice Lake, Zen 4, and later. Eä emits both feature flags when --avx512 is set.

TypeLanesElementSizeFeature
f32x1616f3264 bytesAVX-512F
f64x88f6464 bytesAVX-512F
i32x1616i3264 bytesAVX-512F
u64x88u6464 bytesAVX-512F
i16x3232i1664 bytesAVX-512BW
u16x3232u1664 bytesAVX-512BW
i8x6464i864 bytesAVX-512BW
u8x6464u864 bytesAVX-512BW

Using these types without --avx512 produces a compile error.

Half-precision and sub-128-bit narrow widths

The lexer also accepts f16x4 / f16x8 (half-precision float vectors) and sub-128-bit narrow widths used by ARM NEON widening intrinsics: i8x4, i8x8, u8x8, i16x4, u16x4, i32x2. These have target-specific constraints (f16 requires --fp16 on aarch64 and is unavailable on plain x86; the narrow widths exist primarily as inputs to widening multiplies like wmul_i32(i16x4, i16x4) -> i32x4) that are documented alongside the intrinsics that consume them rather than as standalone table rows. See ARM / NEON reference for f16 and All Intrinsics for the narrow-width consumers.

Pointer Types

Pointers represent caller-provided memory. Eä never allocates -- all memory comes from the host language.

SyntaxDescription
*TImmutable pointer to T
*mut TMutable pointer to T
*restrict TImmutable pointer, no-alias guarantee
*restrict mut TMutable pointer, no-alias guarantee

The restrict qualifier tells the compiler that the pointer does not alias other pointers, enabling stronger optimizations. Use it when you can guarantee non-overlapping memory.

Struct Types

User-defined value types declared with struct:

struct Pixel {
    r: f32,
    g: f32,
    b: f32,
    a: f32,
}

Structs are passed by value. Access fields with dot syntax: pixel.r. Structs can contain scalar types, vector types, and other structs.

Type Rules

  • No implicit conversions between types. Use to_f32(), to_i32(), etc.
  • No generics or polymorphism. Write separate functions for each type.
  • Vector dot-operators require both operands to have the same vector type.
  • Comparisons on vectors (.==, .<, .>) produce boolean vectors, not scalar bools.

All Intrinsics

Complete reference for every built-in function in Eä.

Memory

load

Load a vector from a pointer at a byte offset. The return type is inferred from context.

let v: f32x8 = load(ptr, i);

Typed Scalar Loads

Load a single scalar value from a pointer at a byte offset.

IntrinsicReturn Type
load_f32(ptr, i)f32
load_f64(ptr, i)f64
load_i32(ptr, i)i32
load_i16(ptr, i)i16
load_i8(ptr, i)i8
load_u8(ptr, i)u8
load_u16(ptr, i)u16
load_u32(ptr, i)u32
load_u64(ptr, i)u64
let x: f32 = load_f32(ptr, i);

Typed Vector Loads

Load a full vector from a pointer at a byte offset.

IntrinsicReturn Type
load_f32x4(ptr, i)f32x4
load_f32x8(ptr, i)f32x8
load_f32x16(ptr, i)f32x16
load_i32x4(ptr, i)i32x4
load_i32x8(ptr, i)i32x8
load_i16x8(ptr, i)i16x8
load_i8x16(ptr, i)i8x16
load_u8x16(ptr, i)u8x16
load_u8x32(ptr, i)u8x32
let v: f32x8 = load_f32x8(data, i * 32);

store

Write a vector to a pointer at a byte offset.

store(out, i, result);

stream_store

Non-temporal store that bypasses the CPU cache. Use for write-only output the kernel will not read back soon. Pairs with prefetch_nta (the read-side non-temporal hint) and fence_nt (the ordering primitive).

stream_store(out, i, v)              // vector form — v is f32xN, i32xN, etc.
stream_store(out, i, scalar_value)   // scalar form (v1.15.0) — i16/u16/i32/u32/i64/u64

Target lowering (verified on LLVM 18.1.8, x86_64 Zen 4 and aarch64 Cortex-A76):

Width / formx86aarch64
Vector 128-bit (f32x4, i32x4, ...)movntps / movntdqstnp d, d, [x] (LLVM splits the 128-bit q-register into a d-pair to use the only available aarch64 NT store)
Vector 256-bit (AVX2)vmovntps / vmovntdqn/a (256-bit not on NEON)
Vector 512-bit (AVX-512)vmovntps / vmovntdq zmmn/a
Scalar i64 / u64movnti (SSE2, 64-bit mode)stnp w, w, [x] (LLVM splits the i64 into a w-pair to use stnp; emits an lsr for the high half)
Scalar i32 / u32movnti (SSE2)plain str — NT hint silently dropped
Scalar i16 / u16regular mov — NT hint silently droppedplain strh — NT hint silently dropped

aarch64 has no scalar non-temporal store instruction. The only NT store on aarch64 is stnp (Store Non-temporal Pair), which requires two operands. LLVM 18 honors !nontemporal only when it can synthesize an stnp:

  • 64-bit scalars (i64/u64) self-pair to a w register pair — NT hint preserved.
  • 128-bit vectors self-pair to a d register pair — NT hint preserved.
  • 32-bit and 16-bit scalars have no pair-friendly form — LLVM emits plain str / strh and the NT hint is dropped silently.

For aarch64 NT semantics, prefer 64-bit or wider element widths. The i32/u32 and i16/u16 scalar overloads still type-check and run, but provide no cache- bypass benefit on aarch64; the same is true of i16/u16 on x86 (no movnti16 exists). They ship for cross-platform shape symmetry — a single Eä kernel using stream_store compiles and runs on both targets without per-width branching.

Note also that LLVM 18 does not fuse two adjacent stream_store(*mut i32, ...) calls into a single stnp pair on aarch64 — they lower to a regular stp with the NT hint dropped. If you need NT-paired stores on aarch64, write the data as i64 (two 32-bit values packed) or as a vector type.

Alignment contract:

Vector stream_store requires the destination pointer plus byte offset to be aligned to the vector's natural size (16 bytes for 128-bit, 32 bytes for 256-bit, 64 bytes for 512-bit). Misaligned NT vector stores raise a general protection fault on x86. Scalar stream_store requires natural alignment to the scalar size on x86 (4-byte for i32/u32, 8-byte for i64/u64). Callers must provide aligned buffers; Eä does not insert runtime alignment checks.

Ordering contract:

NT stores are weakly ordered on x86 (write-combining memory order). Other cores or subsequent reads in the same thread may observe them out of program order. For cross-thread visibility, the typical pathway is through a host-side synchronization primitive after the kernel returns (pthread_join, rayon::scope, WaitGroup.Wait) — these provide release semantics that flush WC buffers. For intra-kernel ordering (writing then reading the same memory in the same kernel call), use fence_nt() explicitly. Eä does not insert an implicit fence at kernel return.

When NOT to use:

Do not use stream_store for working buffers the same kernel reads back. The non-temporal hint asks the cache to not keep the line; if the kernel reads the data soon afterward, the read goes to DRAM and is slower than a regular store followed by a cache hit. Working-buffer examples that should use plain store:

  • Softmax accumulators (e.g. scores_buf in attention kernels)
  • FWHT scratch arrays (e.g. scratch in JL-projection kernels)
  • Per-iteration partial sums or running statistics

stream_store is appropriate when the destination is a final output passed to the next kernel call, a memory region the current kernel never re-reads, or a buffer that will not be touched again until a downstream consumer pulls it from DRAM later.

fence_nt

Store-store memory barrier providing intra-kernel ordering of preceding stream_store operations. Zero arguments, returns void.

fence_nt()

Target lowering:

TargetInstruction
x86sfence (via @llvm.x86.sse.sfence)
aarch64dmb ishst (via @llvm.aarch64.dmb with operand 10)

These are the narrowest available barriers for store-only ordering — explicit target intrinsics rather than the IR-level fence release, which would lower to mfence on x86 and dmb ish on aarch64 (both heavier than needed for NT-store ordering).

Semantics:

fence_nt() orders stream_store writes relative to each other and relative to subsequent regular stores. It does not order stores relative to subsequent loads — for a write-then-read-back pattern in the same kernel, a full barrier (mfence on x86, dmb sy on aarch64) is needed instead. Eä does not currently expose a full-barrier intrinsic.

When to use:

Use fence_nt() when the same kernel writes via stream_store to multiple non-overlapping regions in a defined order and a later kernel (or downstream reader) relies on observing those writes in the same order. This is uncommon — most callers don't need it, because cross-thread visibility comes from the host's sync primitive (pthread_join, rayon::scope, WaitGroup.Wait) which already provides release semantics that flush write-combining buffers between threads.

When NOT to use:

  • Between successive stream_store calls to different addresses if no store-ordering requirement exists — NT stores to the same address complete in program order regardless of fences.
  • At the end of a kernel as "insurance" — the caller's sync primitive handles cross-thread fencing more efficiently after the kernel returns.
  • For write-then-read-back patterns — fence_nt() does not provide store- to-load ordering. Use a regular store for the working data, or do not read NT-written data back in the same kernel.

load_masked

Masked vector load. Lanes where the mask is false are not loaded.

let v: f32x8 = load_masked(ptr, i, mask);

store_masked

Masked vector store. Only lanes where the mask is true are written.

store_masked(out, i, value, mask);

gather

Load elements from scattered memory addresses using an index vector. x86 only -- not available on ARM.

let v: f32x8 = gather(ptr, indices);

scatter

Store elements to scattered memory addresses using an index vector. AVX-512 only (--avx512 flag required).

scatter(ptr, indices, values);

prefetch

Issue a read-intent prefetch hint to bring data into all cache levels (T0).

prefetch(ptr, i)

Lowers to prefetcht0 on x86 / prfm pldl1keep on aarch64.

prefetch_write

Issue a write-intent prefetch hint. Signals the cache coherence protocol to acquire the target line in modified state ahead of the store, avoiding a read-for-ownership stall when the store retires. Use on the upcoming write target of memory-bound store-heavy kernels (e.g. chacha20 ciphertext output, dequantize destinations).

prefetch_write(ptr, i)

Lowers to prefetchw on x86 (requires PRFCHW CPUID; falls back to prefetcht0 on older CPUs) / prfm pstl1keep on aarch64.

prefetch_nta

Issue a non-temporal prefetch hint — bring the line into L1 only and mark it for early eviction. Use for streaming reads the kernel touches exactly once and shouldn't pollute L1/L2 with (e.g. Q4 dequantize input, large one-pass scans).

prefetch_nta(ptr, i)

Lowers to prefetchnta on x86 / prfm pldl1strm on aarch64.

Prefetch hint summary

Intrinsic(rw, locality)x86aarch64
prefetch(0, 3)prefetcht0prfm pldl1keep
prefetch_write(1, 3)prefetchwprfm pstl1keep
prefetch_nta(0, 0)prefetchntaprfm pldl1strm

All three accept (ptr, integer-offset), return void, and are valid in any expression-statement position inside a function body.

Math

sqrt

Square root. Works on scalar f32/f64 and all float vector types.

let y: f32 = sqrt(x);
let v: f32x8 = sqrt(vec);

rsqrt

Reciprocal square root (approximate). Scalar f32 and f32 vector types.

let y: f32 = rsqrt(x);
let v: f32x8 = rsqrt(vec);

exp

Exponential function. Float types.

let y: f32 = exp(x);

fma

Fused multiply-add: computes a * b + c in a single operation with one rounding step. Works on scalar f32 and all float vector types.

let y: f32 = fma(a, b, c);
let v: f32x8 = fma(va, vb, vc);

min

Element-wise minimum. Works on scalar (i32, f32, f64) and vector types.

let m: i32 = min(a, b);
let v: f32x8 = min(va, vb);

max

Element-wise maximum. Works on scalar (i32, f32, f64) and vector types.

let m: i32 = max(a, b);
let v: f32x8 = max(va, vb);

Reduction

Reduce a vector to a single scalar value.

reduce_add

Sum all lanes.

let sum: f32 = reduce_add(v);   // f32x8 -> f32
let sum: i32 = reduce_add(iv);  // i32x8 -> i32

reduce_max

Maximum across all lanes.

let m: f32 = reduce_max(v);

reduce_min

Minimum across all lanes.

let m: f32 = reduce_min(v);

reduce_add_fast

Unordered float reduction. Faster than reduce_add but does not guarantee summation order, so results may differ slightly due to floating-point rounding. Float vectors only.

let sum: f32 = reduce_add_fast(v);

Vector

splat

Broadcast a scalar value to all lanes of a vector. The vector type is inferred from context.

let v: f32x8 = splat(1.0);

shuffle

Compile-time index-driven vector shuffle. Two forms:

Single-source — permute lanes within one vector.

let reversed: f32x4 = shuffle(v, [3, 2, 1, 0])

Each index is in [0, width).

Two-source — pick lanes from two vectors of the same type.

let zipped: f32x8 = shuffle(a, b, [0, 8, 1, 9, 2, 10, 3, 11])

Indices in [0, width) select from a; indices in [width, 2 * width) select lane i - width from b. Common patterns: interleave (lower-half zip), blend (pick lanes by index), and concatenate-permute (lower-half from one source, upper from the other with permutation).

permute_runtime

Runtime-indexed permute of an 8-lane vector. result[k] = table[indices[k] & 0x7]. Index lanes are 3-bit-masked by the hardware — the upper 29 bits of each i32 index are ignored. Out-of-range indices wrap; they do not trap.

  • x86: single instruction (vpermps for f32, vpermd for i32). Requires AVX2.
  • ARM (NEON): not supported. Compile-time error pointing at the NEON runtime-permute idiom.
let table: f32x8 = load(matrix_row, 0)   // 6 active + 2 don't-care
let indices: i32x8 = load(types_v, 0)    // values in [0..5]
let strengths: f32x8 = permute_runtime(table, indices)
Signature(f32x8, i32x8) -> f32x8, (i32x8, i32x8) -> i32x8

See also: shuffle (compile-time indices), gather (pointer-indexed), shuffle_bytes (byte-level, 16-byte table, cross-platform).

select

Per-lane conditional select. Where the mask is true, take from a; where false, take from b.

let result: f32x8 = select(mask, a, b);

movemask

Extract a comparison result bitmask from a boolean vector to a scalar i32. Each bit corresponds to the sign bit of one lane. x86 only -- not available on ARM.

let bits: i32 = movemask(cmp_result);

Conversion

Scalar Casts

IntrinsicDescription
to_f32(x)Convert to f32
to_f64(x)Convert to f64
to_i32(x)Convert to i32
to_i64(x)Convert to i64
let f: f32 = to_f32(i);
let n: i32 = to_i32(x);

Widening Conversions

Widen narrow integer lanes to wider float or integer lanes. Only the first N lanes of the input are consumed.

IntrinsicInputOutput
widen_i8_f32x4(v)i8x16f32x4
widen_u8_f32x4(v)u8x16f32x4
widen_i8_f32x8(v)i8x16f32x8
widen_u8_f32x8(v)u8x16f32x8
widen_i8_f32x16(v)i8x16f32x16
widen_u8_f32x16(v)u8x16f32x16
widen_u8_i32x4(v)u8x16i32x4
widen_u8_i32x8(v)u8x16i32x8
widen_u8_i32x16(v)u8x16i32x16
let pixels: f32x8 = widen_u8_f32x8(raw_bytes);

Lane-offset variants

The _4, _8, _12 suffixes select which 4 bytes of the input to widen, eliminating the need for a shuffle before widening:

IntrinsicInputOutputBytes used
widen_u8_f32x4_4(v)u8x16f32x44-7
widen_u8_f32x4_8(v)u8x16f32x48-11
widen_u8_f32x4_12(v)u8x16f32x412-15
widen_i8_f32x4_4(v)i8x16f32x44-7
widen_i8_f32x4_8(v)i8x16f32x48-11
widen_i8_f32x4_12(v)i8x16f32x412-15
widen_u8_i32x4_4(v)u8x16i32x44-7
widen_u8_i32x4_8(v)u8x16i32x48-11
widen_u8_i32x4_12(v)u8x16i32x412-15

Process all 16 bytes of a u8x16 as 4 groups of f32x4 without any shuffles:

let f0: f32x4 = widen_u8_f32x4(v)      // bytes 0-3
let f1: f32x4 = widen_u8_f32x4_4(v)    // bytes 4-7
let f2: f32x4 = widen_u8_f32x4_8(v)    // bytes 8-11
let f3: f32x4 = widen_u8_f32x4_12(v)   // bytes 12-15

Narrowing Conversions

Convert wider lanes to narrower lanes, with clamping and rounding.

IntrinsicInputOutput
narrow_f32x4_i8(v)f32x4i8 (4 bytes)
let packed = narrow_f32x4_i8(float_pixels);

Multiply-Add Byte Pairs

Multiply unsigned bytes by signed bytes and add adjacent pairs. x86 only.

IntrinsicSignatureDescription
maddubs_i16(a, b)(u8x16, i8x16) -> i16x8Multiply and add adjacent pairs to 16-bit
maddubs_i32(a, b)(u8x16, i8x16) -> i32x4Multiply and add adjacent quads to 32-bit
let products: i16x8 = maddubs_i16(unsigned_bytes, signed_weights);

vdot_i32

Signed integer dot product: multiplies groups of 4 i8 pairs and sums each group into one i32 lane. ARM only -- requires --dotprod flag (ARMv8.2-A dot product extension). Maps to NEON sdot.

let dot: i32x4 = vdot_i32(activations, weights);
acc = acc .+ vdot_i32(a, b);  // accumulate explicitly
Signature(i8x16, i8x16) -> i32x4

I8MM Matrix Multiply

Matrix multiply-accumulate on int8 data. ARM only -- requires --i8mm flag (ARMv8.6-A I8MM extension). Available on Cortex-A78+, Apple M1+.

IntrinsicSignatureDescription
smmla_i32(acc, a, b)(i32x4, i8x16, i8x16) -> i32x4Signed x signed
ummla_i32(acc, a, b)(i32x4, u8x16, u8x16) -> i32x4Unsigned x unsigned
usmmla_i32(acc, a, b)(i32x4, u8x16, i8x16) -> i32x4Unsigned x signed

The accumulator is the first argument. Each instruction performs a 2x8 x 8x2 matrix multiply and adds the result to the accumulator. Use splat(0) as accumulator for the first iteration.

let zero: i32x4 = splat(0);
let result: i32x4 = smmla_i32(zero, activations, weights);
// accumulate over multiple chunks:
acc = smmla_i32(acc, next_a, next_b);

Widening Multiply

Multiply narrow integer lanes and produce wider output. ARM only (base NEON). Input types are 64-bit NEON vectors.

IntrinsicSignatureDescription
wmul_i16(a, b)(i8x8, i8x8) -> i16x8Signed 8-bit to 16-bit
wmul_u16(a, b)(u8x8, u8x8) -> u16x8Unsigned 8-bit to 16-bit
wmul_i32(a, b)(i16x4, i16x4) -> i32x4Signed 16-bit to 32-bit
wmul_u32(a, b)(u16x4, u16x4) -> u32x4Unsigned 16-bit to 32-bit
let wide: i16x8 = wmul_i16(bytes_a, bytes_b);

Absolute Difference

Element-wise |a - b|. ARM only (base NEON). Maps to a single instruction (sabd/uabd).

IntrinsicSupported Types
abs_diff(a, b)i8x16, u8x16, i16x8, u16x8, i32x4, u32x4
let diff: u8x16 = abs_diff(frame_a, frame_b);

Saturating Arithmetic

Addition and subtraction that clamp to the type's min/max instead of wrapping on overflow. Cross-platform (ARM NEON + x86 SSE2).

IntrinsicSupported Types
sat_add(a, b)i8x16, u8x16, i16x8, u16x8
sat_sub(a, b)i8x16, u8x16, i16x8, u16x8

Signed vs unsigned saturation is determined by the element type. Both arguments must have the same type.

let bright: u8x16 = sat_add(pixels, boost);    // clamps at 255, never wraps
let dark: u8x16 = sat_sub(pixels, reduce);     // clamps at 0, never wraps

shuffle_bytes

Byte-level table lookup: each byte in indices selects a byte from table. Cross-platform. x86: SSSE3 pshufb. ARM: NEON tbl. Out-of-range indices (>15) zero the lane on both platforms.

let result: u8x16 = shuffle_bytes(table, indices);
Signature(u8x16, u8x16) -> u8x16

Rounding & Packing

IntrinsicSignatureDescriptionPlatform
round_f32x4_i32x4(f32x4) -> i32x4Round-to-nearest-even. x86: cvtps2dq. ARM: fcvtns.cross-platform
pack_sat_i32x4(i32x4, i32x4) -> i16x8Saturating narrow. x86: packssdw. ARM: sqxtn.cross-platform
pack_sat_i16x8(i16x8, i16x8) -> i8x16Saturating narrow. x86: packsswb. ARM: sqxtn.cross-platform
round_f32x8_i32x8(f32x8) -> i32x8Round-to-nearest-even float to integer. x86: vcvtps2dq (AVX2).x86-only
pack_sat_i32x8(i32x8, i32x8) -> i16x16Saturating narrow two i32x8 into i16x16. x86: vpackssdw (AVX2).x86-only
pack_sat_i16x16(i16x16, i16x16) -> i8x32Saturating narrow two i16x16 into i8x32. x86: vpacksswb (AVX2).x86-only

pack_sat_i32x8 and pack_sat_i16x16 emit a vpermq fixup after the AVX2 pack to produce sequential output [a0..a7, b0..b7], matching what the 128-bit variants produce.

Debug

println

Print a value to stdout. Accepts scalars (i32, i64, u8, u16, u32, u64, f32, f64, bool), string literals, and vector types. Lowers to C printf. No format strings.

println(42);
println(3.14);
println("hello");
println(my_vector);

CLI Reference

The ea binary provides three commands: compile (default), bind, and inspect.

Compile

ea <file.ea> [flags]

Compile an Eä source file to a native object file (.o) by default.

Flags

FlagEffect
-o <name>Link the object file into an executable via cc
--libProduce a shared library (.so/.dll) and metadata (.ea.json)
--opt-level=NOptimization level 0--3 (default: 3)
--avx512Enable AVX-512 vector types and intrinsics. Errors on ARM targets
--dotprodEnable ARMv8.2-A dot product extension (vdot_i32). ARM targets only
--i8mmEnable ARMv8.6-A I8MM extension (smmla_i32, ummla_i32, usmmla_i32). ARM targets only
--target=CPULLVM CPU name, e.g. skylake, znver3, native (default: native)
--target-triple=TCross-compile to a different architecture, e.g. aarch64-unknown-linux-gnu
--emit-llvmWrite LLVM IR to a .ll file and print it to stdout
--emit-asmWrite assembly to a .s file
--headerGenerate a C header (.h) for the exported functions
--emit-astPrint the parsed AST. Does not require LLVM
--emit-tokensPrint the token stream. Does not require LLVM
--help / -hPrint usage information
--version / -VPrint compiler version

Examples

# Compile to object file
ea kernel.ea

# Compile and link to executable
ea kernel.ea -o kernel

# Build shared library for Python/Rust/C++ consumption
ea kernel.ea --lib

# Cross-compile for ARM
ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu

# Emit LLVM IR for debugging
ea kernel.ea --emit-llvm

# Compile with AVX-512 support
ea kernel.ea --lib --avx512

# Cross-compile for ARM with dot product extension
ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu --dotprod

# Cross-compile for ARM with I8MM matrix multiply
ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu --i8mm

# Generate C header
ea kernel.ea --header

Compiler status output goes to stderr, so --emit-llvm stdout is clean for piping.

Bind

ea bind <file.ea> --python [--rust] [--cpp] [--pytorch] [--cmake]

Generate language bindings from a compiled kernel. At least one language flag is required.

Requires the .ea.json metadata file produced by ea <file.ea> --lib. Run the --lib compile first.

Language Flags

FlagOutputDescription
--python<name>.pyPython wrapper using ctypes
--rust<name>.rsRust FFI bindings
--cpp<name>.hppC++ header with inline wrappers
--pytorch<name>_torch.pyPyTorch custom op wrapper
--cmakeCMakeLists.txtCMake build file for C++ integration

Example

# Full workflow: compile, then generate Python bindings
ea kernel.ea --lib
ea bind kernel.ea --python

Inspect

ea inspect <file.ea> [target flags]

Post-optimization analysis of the compiled kernel. Shows instruction mix, loop structure, vector width usage, and register pressure. Accepts the same target flags as compile (--target, --target-triple, --avx512, --dotprod, --i8mm).

Example

ea inspect kernel.ea
ea inspect kernel.ea --avx512
ea inspect kernel.ea --target-triple=aarch64-unknown-linux-gnu
ea --print-target

Print the resolved native CPU name for the current machine. Useful for understanding which CPU features the compiler will target by default.

Python API

The ea Python package (pip install ea-compiler) provides a high-level interface for compiling and loading Eä kernels.

Functions

ea.load

ea.load(path, *, target="native", opt_level=3, avx512=False) -> KernelModule

Compile an .ea file to a shared library and load it. Returns a KernelModule object with each exported function available as a method.

ParameterTypeDefaultDescription
pathstrrequiredPath to the .ea source file
targetstr"native"LLVM CPU name (e.g. "skylake", "native")
opt_levelint3Optimization level 0--3
avx512boolFalseEnable AVX-512 types and intrinsics
import ea
import numpy as np

k = ea.load("kernel.ea")
a = np.array([1.0, 2.0, 3.0], dtype=np.float32)
b = np.zeros(3, dtype=np.float32)
k.my_func(a, b, len(a))

ea.compile

ea.compile(path, *, emit_asm=False, emit_llvm=False, target="native", opt_level=3, avx512=False, lib=True) -> Path

Compile an .ea file without loading it. Returns the path to the output file. Useful when you need the .so or .ea.json for another tool.

ParameterTypeDefaultDescription
pathstrrequiredPath to the .ea source file
emit_asmboolFalseAlso write a .s assembly file
emit_llvmboolFalseAlso write a .ll LLVM IR file
targetstr"native"LLVM CPU name
opt_levelint3Optimization level 0--3
avx512boolFalseEnable AVX-512
libboolTrueProduce .so + .ea.json

ea.clear_cache

ea.clear_cache(path=None)

Clear the compilation cache. If path is given, clear only the cache for that .ea file. If None, clear the entire cache directory.

ea.compiler_version

ea.compiler_version() -> str

Return the version string of the ea compiler binary.

ea.__version__

ea.__version__ -> str

The version of the ea Python package.

Exceptions

ea.CompileError

Raised when compilation fails. Inherits from RuntimeError.

AttributeTypeDescription
stderrstrFull compiler error output
exit_codeintCompiler process exit code
try:
    k = ea.load("broken.ea")
except ea.CompileError as e:
    print(e.stderr)
    print(e.exit_code)

Caching

Compiled shared libraries are cached in a __eacache__/ directory next to the .ea source file. The cache key includes the CPU name and compiler version:

__eacache__/{cpu}-{version}/kernel.so

The cache is invalidated by file modification time (mtime). If the source file is newer than the cached library, ea.load() recompiles automatically.

Length Collapsing

When a function parameter named n, len, length, count, size, or num appears immediately after a pointer parameter and has an integer type, the Python binding automatically fills it from the array's length. You do not need to pass it explicitly.

// Eä source
export func scale(data: *mut f32, n: i32, factor: f32) { ... }
# Python: n is auto-filled from len(data)
k.scale(data, factor=2.0)

Output Allocation

Parameters annotated with out and a [cap: ...] clause are automatically allocated by the Python binding and returned as the function's result.

// Eä source
export func transform(input: *f32, n: i32, out result: *mut f32 [cap: n]) { ... }
# Python: result is allocated and returned
output = k.transform(input_array)

If [count: path] is also specified, the returned array is trimmed to the actual output length.

Thread Safety

Loaded kernel modules and their functions are safe for concurrent use from multiple threads. The compiled code is stateless -- all memory is caller-provided via arguments.

Binding Annotations

Eä generates language bindings from compiled kernel metadata. This page documents the annotation syntax, the .ea.json metadata format, and the available binding generators.

Output Annotations

Mark a parameter as caller-allocated output with the out keyword and a capacity clause:

export func filter(
    input: *f32,
    n: i32,
    out result: *mut f32 [cap: n, count: out_count],
    out out_count: *mut i32 [cap: 1],
) { ... }

Syntax

out name: *mut T [cap: <expr>]
out name: *mut T [cap: <expr>, count: <path>]
ClauseRequiredDescription
capYesNumber of elements to allocate. Can reference other parameters (e.g. n, width * height)
countNoParameter or expression giving the actual output length. The binding trims the returned array to this length

In generated bindings, out parameters are not part of the function signature. They are allocated internally and returned as the function's result.

Length Collapsing

When a pointer parameter is followed immediately by an integer parameter whose name matches one of the recognized patterns, the binding generators automatically fill the integer from the array's length.

Recognized Names

n, len, length, count, size, num

Rules

  1. The integer parameter must appear immediately after a pointer parameter
  2. The integer parameter must have an integer type (i32, i64, u32, etc.)
  3. The pointer and integer are collapsed into a single array argument in the binding
// These two parameters collapse into one array argument:
export func sum(data: *f32, n: i32) -> f32 { ... }
# Python: just pass the array, n is filled automatically
result = k.sum(my_array)

Metadata Format (.ea.json)

Compiling with --lib produces a .ea.json file alongside the shared library. This file describes the exported API and is consumed by ea bind.

{
  "library": "libkernel.so",
  "exports": [
    {
      "name": "scale",
      "args": [
        {"name": "data", "type": "*mut f32"},
        {"name": "n", "type": "i32"},
        {"name": "factor", "type": "f32"}
      ],
      "return_type": null
    }
  ],
  "structs": [
    {
      "name": "Point",
      "fields": [
        {"name": "x", "type": "f32"},
        {"name": "y", "type": "f32"}
      ]
    }
  ]
}

Output-annotated parameters include additional fields:

{
  "name": "result",
  "type": "*mut f32",
  "output": true,
  "cap": "n",
  "count": "out_count"
}

Binding Generators

Generate bindings with ea bind <file.ea> and one or more language flags. The .ea.json file must exist (run ea <file.ea> --lib first).

FlagOutput FileDescription
--python<name>.pyPython wrapper using ctypes. Arrays as NumPy ndarrays.
--rust<name>.rsRust extern "C" declarations with safe wrapper functions
--cpp<name>.hppC++ header-only bindings with std::span parameters
--pytorch<name>_torch.pyPyTorch custom op with torch.Tensor parameters
--cmakeCMakeLists.txtCMake project for linking the shared library in C++

All generated bindings use C ABI function pointers loaded from the shared library at runtime.

ARM / NEON

Eä supports AArch64 with NEON vector instructions. This page documents the differences from x86 and how to write portable kernels.

Vector Width

ARM NEON provides 128-bit vector registers, plus 64-bit D-registers for narrower operations. Available vector types on ARM:

128-bit (standard)64-bit (NEON D-registers)Not Supported
f32x4i8x8, u8x8f32x8, f32x16
f64x2i16x4, u16x4f64x4
i32x4, u32x4i32x2i32x8, i32x16
i16x8, u16x8i16x16
i8x16, u8x16i8x32, u8x32

64-bit vector types are ARM-only. Using them on x86 produces a compile error. 256-bit and 512-bit types are x86-only and produce a compile error on ARM. This is intentional -- Eä does not silently fall back to scalar code.

Unavailable Intrinsics

The following intrinsics are x86-only and produce a compile error on ARM:

IntrinsicReason
movemask(v)No ARM equivalent for extracting lane sign bits to a bitmask
gather(ptr, indices)No hardware gather support in NEON
scatter(ptr, indices, values)AVX-512 only
maddubs_i16(a, b)x86 PMADDUBSW instruction, no NEON equivalent
maddubs_i32(a, b)x86 specific
round_f32x8_i32x8(a)AVX2 (256-bit); use round_f32x4_i32x4 on ARM
pack_sat_i32x8(a, b)AVX2 (256-bit); use pack_sat_i32x4 on ARM
pack_sat_i16x16(a, b)AVX2 (256-bit); use pack_sat_i16x8 on ARM

All other intrinsics (loads, stores, math, reductions, splat, select, shuffle, conversions) work on ARM.

ARM-Specific Intrinsics

Dot Product (ARMv8.2-A, --dotprod)

IntrinsicSignatureDescription
vdot_i32(a, b)(i8x16, i8x16) -> i32x4Signed dot product, groups of 4 per lane. Maps to NEON sdot.

I8MM Matrix Multiply (ARMv8.6-A, --i8mm)

IntrinsicSignatureDescription
smmla_i32(acc, a, b)(i32x4, i8x16, i8x16) -> i32x4Signed x signed 2x8 x 8x2 matrix multiply-accumulate
ummla_i32(acc, a, b)(i32x4, u8x16, u8x16) -> i32x4Unsigned x unsigned matrix multiply-accumulate
usmmla_i32(acc, a, b)(i32x4, u8x16, i8x16) -> i32x4Unsigned x signed matrix multiply-accumulate

The accumulator is explicit. First call uses splat(0) for zero-init. Available on Cortex-A78+, Apple M1+.

Absolute Difference (base NEON)

IntrinsicSignatureDescription
abs_diff(a, b)(T, T) -> TElement-wise absolute difference. Returns `

Supported types: i8x16, u8x16, i16x8, u16x8, i32x4, u32x4. Maps to NEON sabd/uabd (one instruction). No x86 equivalent -- use max(a .- b, b .- a) explicitly on x86.

Widening Multiply (base NEON)

IntrinsicSignatureDescription
wmul_i16(a, b)(i8x8, i8x8) -> i16x8Signed widening multiply
wmul_u16(a, b)(u8x8, u8x8) -> u16x8Unsigned widening multiply
wmul_i32(a, b)(i16x4, i16x4) -> i32x4Signed widening multiply
wmul_u32(a, b)(u16x4, u16x4) -> u32x4Unsigned widening multiply

Input types are 64-bit NEON vectors (D-registers). Output is 128-bit. Maps to NEON smull/umull. No x86 equivalent.

The --dotprod and --i8mm flags enable their respective extensions. Using an intrinsic without its flag produces a compile error with a hint to add it.

Cross-Platform Intrinsics

These intrinsics work on both x86 and ARM with identical semantics:

IntrinsicSignaturex86 instructionARM instruction
shuffle_bytes(table, idx)(u8x16, u8x16) -> u8x16SSSE3 pshufbNEON tbl
sat_add(a, b)(T, T) -> TSSE2 padds/paddusNEON sqadd/uqadd
sat_sub(a, b)(T, T) -> TSSE2 psubs/psubusNEON sqsub/uqsub

sat_add/sat_sub support i8x16, u8x16, i16x8, u16x8. Signed vs unsigned saturation is determined by the element type. No feature flags required (base SSE2 and NEON).

Cross-Compilation

Compile for ARM from an x86 host:

ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu

For intrinsics requiring hardware extensions, add the appropriate flag:

ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu --dotprod

When compiling natively on an ARM machine, no special flags are needed (except --dotprod for dot product intrinsics):

ea kernel.ea --lib
ea kernel.ea --lib --dotprod   # for vdot_i32
ea kernel.ea --lib --i8mm     # for smmla_i32, ummla_i32, usmmla_i32

The --avx512 flag is rejected on ARM targets with a compile error. The --i8mm and --dotprod flags are rejected on x86 targets.

Writing Portable Kernels

Strategy 1: Use 128-bit types everywhere

Use f32x4, i32x4, etc. These work on both x86 (SSE) and ARM (NEON).

kernel scale(data: *mut f32, factor: f32) range(n) step(4) {
    let v: f32x4 = load(data, i);
    let f: f32x4 = splat(factor);
    store(data, i, v .* f);
}

This sacrifices throughput on x86 (which could use f32x8 with AVX2) but runs everywhere.

Strategy 2: Separate kernel files

Write platform-specific kernels in separate files and load the right one at runtime:

# kernel_x86.ea  -- uses f32x8 for AVX2
# kernel_arm.ea  -- uses f32x4 for NEON
import platform
if platform.machine() == "aarch64":
    k = ea.load("kernel_arm.ea")
else:
    k = ea.load("kernel_x86.ea")

Both files export the same function signatures, so the calling code does not change.

Strategy 3: Use the kernel construct

The kernel construct with step(N) lets you pick the vector width per file while keeping the loop logic identical. Write two .ea files with different step sizes and vector types, but the same exported function name and parameters.

128-bit Pack and Round on NEON

For 128-bit NEON equivalents of the x86 AVX2 pack/round intrinsics, use the cross-platform 128-bit variants:

  • round_f32x4_i32x4: round-to-nearest-even f32x4 to i32x4. x86: cvtps2dq. ARM: fcvtns.
  • pack_sat_i32x4: saturating narrow (i32x4, i32x4) -> i16x8. x86: packssdw. ARM: sqxtn.
  • pack_sat_i16x8: saturating narrow (i16x8, i16x8) -> i8x16. x86: packsswb. ARM: sqxtn.

The 256-bit variants (round_f32x8_i32x8, pack_sat_i32x8, pack_sat_i16x16) are x86-only and produce a compile error on ARM.

ea bench

Builds an .ea kernel and a C harness, runs the harness pinned to one core (Linux), captures JSONL measurements, wraps them with environment metadata, and (when a baseline exists) diffs against it.

Usage

ea bench <manifest.toml> [--target=CPU] [--avx512|--fp16|--i8mm|--dotprod]
                         [--opt-level=N] [--update-baseline] [--no-diff]
                         [--out PATH]

A benchmark is a triple: an .ea kernel (--lib-compatible), a C harness that links against the kernel's shared library, and a manifest TOML that names them.

Manifest schema

*.bench.toml files use this flat schema:

keytyperequiredmeaning
namestringyesBenchmark name (also the kernel's lib name)
kernelstring (path)yesPath to .ea, relative to the manifest
harnessstring (path)yesPath to the C harness
baselinestring (path)yesPath to the committed baseline JSON
archarray of stringyes"x86_64" and/or "aarch64" — platforms where the kernel applies
ea_flagsarray of stringnoExtra flags for the kernel build (e.g. ["--fp16"])
cc_flagsarray of stringnoExtra flags for the harness build (default ["-O2"])

Unknown keys are an error. Paths resolve relative to the manifest's parent directory; absolute paths pass through unchanged.

Harness contract

A harness is a normal C program that links against the kernel's shared library. It must:

  • Write JSONL measurements to stdout, one per line. Required keys: kernel (string), median_ns (integer). Optional: p10_ns, p90_ns, n_inner, n_runs. Any additional keys are passed through into the result JSON.
  • Send everything else to stderr. Banners, verify-OK / verify-FAIL messages, debug output, sink values. ea bench relays harness stderr with a [harness] prefix to its own stderr.
  • Exit 0 on success, non-zero on fatal error. ea bench propagates a non-zero exit.

Example measurement line:

{"kernel":"softmax_poly","median_ns":12345,"p10_ns":12200,"p90_ns":12600,"n_inner":200,"n_runs":10}

See benchmarks/v1.11.0/exp_poly_f32_harness.c for a complete template: deterministic LCG-filled input, warmup, median of N runs of M inner calls, volatile sink to defeat dead-code elimination, verify against a reference implementation, JSONL summary at the end.

Output schema

ea bench emits a single JSON object (to stdout, or --out PATH):

{
  "schema_version": 1,
  "name": "exp_poly_f32",
  "eacompute_version": "1.13.0",
  "git_sha": "f2ca320",
  "timestamp": "2026-05-14T10:23:00Z",
  "env": {
    "os": "linux", "arch": "x86_64", "host_cpu": "znver4",
    "target_cpu": "native", "target_features": "+avx512f,+avx512vl,+avx512bw",
    "opt_level": 3, "pinned": true
  },
  "measurements": [ /* harness JSONL lines, re-emitted */ ]
}
fieldsource
schema_versionalways 1 for this release
namemanifest name
eacompute_versionCARGO_PKG_VERSION of the running compiler
git_shagit rev-parse --short HEAD at run time; null if not in a git tree
timestampISO 8601 UTC
env.host_cpuLLVM TargetMachine::get_host_cpu_name()
env.target_cpu, target_features, opt_levelresolved from CLI flags
env.pinnedtrue if taskset was used (Linux only)

Diff & baselines

If manifest.baseline exists, ea bench reads it and prints a per-kernel delta to stderr. The current regression threshold is 10%, warn-only in v1.13.0 — regressions print WARNING: but the process still exits 0.

exp_poly_f32 (x86_64, native, opt=3):
  exp_only_libm   12345 ns  (baseline 12200 ns, +1.2%)
  exp_only_poly    4321 ns  (baseline  4350 ns, -0.7%)
  softmax_libm    98765 ns  (baseline 96000 ns, +2.9%)
  softmax_poly    32100 ns  (baseline 31500 ns, +1.9%)
WARNING: 0 regressions exceed 10% threshold (warn-only in v1.13.0).

If the baseline's env.arch, env.target_features, or env.opt_level don't match the current run, the diff is skipped and stderr says baseline mismatch: .... Use --update-baseline to refresh.

If the baseline file doesn't exist (first run, or a freshly-added benchmark), stderr says no baseline yet — run with --update-baseline to create one.

Adding a new benchmark

  1. Write a kernel foo_bench.ea exporting export func ... measurement targets.
  2. Write a C harness foo_harness.c following the harness contract above.
  3. Write foo.bench.toml next to the kernel and harness.
  4. Capture the baseline:
    ea bench foo.bench.toml --update-baseline
    
  5. Commit kernel, harness, manifest, and baseline together.

Platform notes

  • Pinning is Linux-only. ea bench uses taskset -c 0 when available. Mac / Windows runs report pinned: false in the result JSON, and measurements will be noisier.
  • Cross-platform manifests must use arch = ["x86_64", "aarch64"] if the kernel + harness compile and run identically. Single-platform benchmarks use the matching single-element list.
  • The cc binary can be overridden via the CC environment variable.
  • Committed baselines reflect the maintainer's development host. CI runner measurements will diverge from these baselines — sometimes by 20%+ on libm-backed kernels — because of CPU model, microarchitecture, and shared-tenancy variance. In v1.13.0 the regression gate is warn-only specifically to collect this signal across one release before setting an enforced threshold. WARNING: N regressions exceed 10% threshold lines in CI artifacts are not (yet) actionable failures.

Eä vs NumPy

When does writing a kernel in Eä actually beat NumPy? The answer comes down to one thing: arithmetic intensity -- how many operations you perform per byte loaded from memory.

The memory bandwidth wall

Modern CPUs can process arithmetic far faster than they can load data from DRAM. A single DDR4 channel delivers ~30-40 GB/s. NumPy's ufuncs are already compiled C with SIMD -- for simple operations, they saturate this bandwidth just like hand-written SIMD would.

Rule of thumb: if your operation does fewer than 2 arithmetic ops per element loaded, it is bandwidth-bound. Eä will match NumPy but not beat it.

Bandwidth-bound: Eä matches, no win

Array scaling

export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
    let f: f32x8 = splat(factor)
    let mut i: i32 = 0
    while i < n {
        let v: f32x8 = load(src, i)
        store(dst, i, v .* f)
        i = i + 8
    }
}
dst = src * factor  # NumPy: one SIMD multiply per element, same speed

One multiply per element loaded. Both Eä and NumPy hit ~35 GB/s on typical hardware. No winner.

Simple element-wise ops

Any operation that loads an element, does one thing, and stores the result is bandwidth-bound:

  • dst[i] = src[i] + offset
  • dst[i] = abs(src[i])
  • dst[i] = src_a[i] + src_b[i]

NumPy already handles these at memory bandwidth speed. Writing an Eä kernel gains you nothing.

Compute-bound: Eä wins

Fused scale + bias + clamp

NumPy must make three separate passes over memory:

dst = np.clip(src * scale + bias, 0.0, 1.0)  # 3 temporaries, 3 passes

Each pass loads the full array, computes one operation, and writes a temporary. For a 100 MB array, that is 600 MB of memory traffic.

Eä fuses everything into a single pass:

export func fused_scale_bias_clamp(src: *f32, dst: *mut f32, scale: f32, bias: f32, n: i32) {
    let s: f32x8 = splat(scale)
    let b: f32x8 = splat(bias)
    let zero: f32x8 = splat(0.0)
    let one: f32x8 = splat(1.0)
    let mut i: i32 = 0
    while i < n {
        let v: f32x8 = load(src, i)
        let result: f32x8 = fma(v, s, b)
        let clamped: f32x8 = min(max(result, zero), one)
        store(dst, i, clamped)
        i = i + 8
    }
}

One load, one FMA, two comparisons, one store. The data stays in registers the entire time. This is 3-5x faster than the NumPy version on large arrays because it reads memory once instead of three times.

Stencil operations

Convolutions, Sobel filters, and any operation that reads multiple neighboring elements per output pixel have high arithmetic intensity. A 3x3 Sobel kernel reads 9 values and performs 9 multiplications plus 8 additions per output -- well above the compute-bound threshold.

See Image Processing for stencil patterns.

Custom reductions with branching

NumPy cannot express per-element branching in vectorized form. Operations like "accumulate values, but skip negatives and double values above a threshold" require Python-level loops or awkward np.where chains.

Eä gives you SIMD comparisons and masked operations in a single loop body, keeping the pipeline full.

Dot products and FMA chains

Any reduction that multiplies and accumulates benefits from FMA (fused multiply-add) and register-level accumulation:

let acc: f32x8 = fma(a, b, acc)  // a * b + acc in one instruction

NumPy's np.dot is fast for large matrices (it calls BLAS), but for custom reductions, per-row operations, or non-standard accumulation patterns, Eä's explicit FMA beats NumPy's element-wise approach.

See ML Preprocessing for dot product and similarity patterns.

Decision checklist

Before writing an Eä kernel, ask:

  1. How many ops per element? If just one (scale, offset, abs), stay with NumPy.
  2. Does NumPy need multiple passes? If your expression chains 3+ operations, Eä's fusion wins.
  3. Is there a stencil or neighbor access? High arithmetic intensity -- Eä wins.
  4. Is there branching logic per element? NumPy cannot vectorize this -- Eä wins.
  5. Is it a standard BLAS operation? Use NumPy/SciPy -- they call optimized libraries.

Real-world packages

These packages demonstrate Eä beating NumPy on compute-bound workloads:

  • easobel -- Sobel edge detection (stencil, ~9 ops/pixel)
  • eastat -- CSV parsing (branching SIMD scan)
  • eavec -- Vector similarity search (FMA-heavy dot products)

Image Processing

Image processing is one of Eä's strongest use cases. Stencil operations (convolution, edge detection, blur) read multiple neighbors per output pixel, giving high arithmetic intensity that keeps the CPU compute-bound rather than memory-bound.

Sobel edge detection

The Sobel filter computes horizontal and vertical gradients using 3x3 stencils. Each output pixel reads 9 input values and performs 9 multiplications plus additions -- well above the threshold where Eä beats NumPy.

The pattern: for each pixel, load the 3x3 neighborhood, multiply by Sobel coefficients, and sum:

export func sobel_x(
    src: *f32, dst: *mut f32,
    width: i32, height: i32
) {
    let neg1: f32x8 = splat(-1.0)
    let pos1: f32x8 = splat(1.0)
    let neg2: f32x8 = splat(-2.0)
    let pos2: f32x8 = splat(2.0)
    let mut y: i32 = 1
    while y < height - 1 {
        let mut x: i32 = 0
        while x < width - 2 {
            let row_above: i32 = (y - 1) * width + x
            let row_center: i32 = y * width + x
            let row_below: i32 = (y + 1) * width + x

            let tl: f32x8 = load(src, row_above)
            let tr: f32x8 = load(src, row_above + 2)
            let ml: f32x8 = load(src, row_center)
            let mr: f32x8 = load(src, row_center + 2)
            let bl: f32x8 = load(src, row_below)
            let br: f32x8 = load(src, row_below + 2)

            let gx: f32x8 = (tr .* pos1) .+ (mr .* pos2) .+ (br .* pos1)
                .+ (tl .* neg1) .+ (ml .* neg2) .+ (bl .* neg1)
            store(dst, row_center + 1, gx)
            x = x + 8
        }
        y = y + 1
    }
}

For a production-ready Sobel implementation with both Gx/Gy gradients and magnitude computation, see easobel.

Pixel pipeline: u8 to f32 and back

Images from disk arrive as u8 (0-255). SIMD math works best on f32. The typical pattern:

  1. Load u8 pixels
  2. Widen to f32 (0.0 to 255.0)
  3. Process in f32 (normalize, filter, blend)
  4. Narrow back to u8

Widening: u8 to f32

export func normalize_u8_to_f32(src: *u8, dst: *mut f32, n: i32) {
    let scale: f32x4 = splat(0.00392156862)
    let mut i: i32 = 0
    while i < n {
        let pixels: f32x4 = widen_u8_f32x4(src, i)
        let normalized: f32x4 = pixels .* scale
        store(dst, i, normalized)
        i = i + 4
    }
}

widen_u8_f32x4(ptr, offset) loads 4 bytes from src + offset, zero-extends each to 32 bits, and converts to float. The result is a f32x4 with values in 0.0 to 255.0. Multiply by 1/255 to get the 0.0-1.0 range.

Narrowing: f32 to u8

export func f32_to_u8(src: *f32, dst: *mut u8, n: i32) {
    let s255: f32x4 = splat(255.0)
    let zero: f32x4 = splat(0.0)
    let mut i: i32 = 0
    while i < n {
        let v: f32x4 = load(src, i)
        let clamped: f32x4 = min(max(v, zero), s255)
        narrow_f32x4_i8(dst, i, clamped)
        i = i + 4
    }
}

narrow_f32x4_i8(ptr, offset, vec) converts 4 floats to integers, saturates to 0-255, and stores 4 bytes. Always clamp before narrowing to avoid overflow.

Saturating arithmetic on u8 pixels

For operations that stay in u8 (brightness adjustment, blending), use sat_add and sat_sub instead of widening to f32. These clamp to 0-255 in a single instruction on both ARM (NEON) and x86 (SSE2):

export func brighten(src: *u8, dst: *mut u8, boost: u8, n: i32) {
    let b: u8x16 = splat(boost)
    let mut i: i32 = 0
    while i < n {
        let pixels: u8x16 = load_u8x16(src, i)
        let bright: u8x16 = sat_add(pixels, b)
        store(dst, i, bright)
        i = i + 16
    }
}

No widening, no clamping, no f32 intermediates. One load, one saturating add, one store. This also works with i8x16 (signed), i16x8, and u16x8.

Putting it together

A full pixel pipeline (load u8, process in f32, store u8) processes 4 pixels per iteration on both x86 and ARM. Use f32x4 for the pipeline to keep it portable -- f32x8 works only on x86 with AVX2.

Why image stencils are compute-bound

A 3x3 convolution on a single-channel image performs 9 multiplications and 8 additions per output pixel, but only produces 1 output value. That is 17 arithmetic operations per output float -- far above the ~2 ops/element threshold where Eä's operation fusion matters.

For multi-channel images (RGB, RGBA), the arithmetic intensity is even higher because you process 3-4 channels per pixel position.

Compare this to simple brightness adjustment (pixel * 1.1), which is 1 op per element -- bandwidth-bound, and NumPy handles it just as fast. See Eä vs NumPy for more on this distinction.

Frame differencing with abs_diff (ARM)

On ARM, abs_diff computes per-pixel absolute difference in a single NEON instruction. Useful for motion detection and video anomaly:

export func frame_diff(a: *u8, b: *u8, dst: *mut u8, n: i32) {
    let mut i: i32 = 0
    while i < n {
        let fa: u8x16 = load_u8x16(a, i)
        let fb: u8x16 = load_u8x16(b, i)
        let diff: u8x16 = abs_diff(fa, fb)
        store(dst, i, diff)
        i = i + 16
    }
}

abs_diff supports i8x16, u8x16, i16x8, u16x8, i32x4, u32x4. ARM-only -- on x86, use max(a .- b, b .- a) explicitly.

Tips

  • Border handling: the examples above skip border pixels (starting at y=1, ending at height-1). For production kernels, handle borders separately with scalar code or clamped indexing.
  • Separable filters: Gaussian blur and similar filters can be split into horizontal and vertical passes, reducing a 3x3 stencil from 9 to 6 operations. Each pass is still compute-bound.
  • ARM portability: use f32x4 and i32x4 for kernels that need to run on both x86 and ARM. The 128-bit types work on both architectures.

Text Processing

Text processing benefits from SIMD when you can skip large regions of uninteresting bytes. The key pattern is chunk-skip: load 32 bytes at a time, check if any byte matches a target character, and skip the entire chunk if none match.

The chunk-skip pattern

Most bytes in a text file are not the character you are looking for. A newline scanner over a 1 MB file might find 10,000 newlines among 1,000,000 bytes -- 99% of chunks contain no match and can be skipped with a single comparison and branch.

export func count_newlines(data: *u8, n: i32) -> i32 {
    let newline: u8x32 = splat_u8x32(10)
    let mut count: i32 = 0
    let mut i: i32 = 0
    while i < n - 31 {
        let chunk: u8x32 = load_u8x32(data, i)
        let matches: u8x32 = chunk .== newline
        let mask: i32 = movemask(matches)
        if mask != 0 {
            let mut j: i32 = 0
            while j < 32 {
                if load_u8(data, i + j) == 10 {
                    count = count + 1
                }
                j = j + 1
            }
        }
        i = i + 32
    }
    let mut k: i32 = i
    while k < n {
        if load_u8(data, k) == 10 {
            count = count + 1
        }
        k = k + 1
    }
    count
}

The structure:

  1. Load 32 bytes as u8x32
  2. Compare with splat_u8x32(target) using .== to get a match vector
  3. movemask() collapses the vector comparison to a single i32 bitmask
  4. If mask is 0: no matches in this chunk, skip ahead 32 bytes (fast path)
  5. If mask is nonzero: scan the 32 bytes individually (slow path, but rare)
  6. Scalar tail: handle remaining bytes that do not fill a full 32-byte chunk

The fast path processes 32 bytes in about 3 instructions. On typical text, 90-99% of chunks take the fast path.

Why not extract individual bit positions?

Eä has element-wise vector shift operators (.<<, .>>), but movemask returns a scalar i32 bitmask. Extracting individual bit positions from a scalar requires scalar bitwise operations (&, >>) which work on i32 values. When a chunk contains matches, fall back to byte-by-byte scanning within that 32-byte window. This is still fast because:

  • Hot chunks are rare (most text is not your target character)
  • The 32-byte scan is a tight loop that fits in L1 cache
  • The overall speedup comes from skipping cold chunks, not from optimizing hot ones

ARM portability

movemask is an x86-only intrinsic (it maps to vpmovmskb). On ARM/NEON, there is no equivalent single instruction. For portable kernels, write separate x86 and ARM versions:

  • x86: use u8x32 + movemask as shown above
  • ARM: use u8x16 with scalar fallback, or structure the algorithm to avoid needing a bitmask

See ARM / NEON for details on architecture-specific intrinsics.

CSV parsing

CSV parsing combines the chunk-skip pattern with state tracking. You need to find commas and newlines while respecting quoted fields -- a comma inside quotes is not a delimiter.

The approach:

  1. Scan for quote characters (") using chunk-skip to track quote state
  2. Scan for delimiters (, and \n) using chunk-skip, filtering by quote state
  3. Parse numeric fields from the delimited regions

This is compute-bound because each byte potentially involves comparison against multiple target characters plus state logic -- exactly the kind of branching work that NumPy cannot vectorize.

For a complete CSV statistics package built on this pattern, see eastat.

Tips

  • Buffer alignment: SIMD loads from unaligned addresses work but may be slower on older hardware. If you control the buffer, align to 32 bytes.
  • Tail handling: always include a scalar loop for the last n % 32 bytes. Forgetting the tail is the most common bug in SIMD text processing.
  • Multi-character search: to find any of several characters (e.g., <, >, & for HTML), do multiple .== comparisons and combine with .| before the movemask.

ML Preprocessing

ML preprocessing pipelines often apply the same sequence of operations to every element in large arrays: normalize, scale, compute similarities. These are natural fits for Eä kernels because they fuse multiple operations into single-pass loops.

Normalize: zero-mean, unit-variance

The standard normalization (x - mean) / std requires two operations per element. NumPy computes them as separate passes:

normalized = (data - mean) / std  # 2 passes, 2 temporaries

Eä fuses this into one pass using FMA. Rewrite (x - mean) / std as x * (1/std) + (-mean/std):

export func normalize(
    src: *f32, dst: *mut f32,
    inv_std: f32, neg_mean_div_std: f32,
    n: i32
) {
    let s: f32x8 = splat(inv_std)
    let b: f32x8 = splat(neg_mean_div_std)
    let mut i: i32 = 0
    while i < n {
        let v: f32x8 = load(src, i)
        let result: f32x8 = fma(v, s, b)
        store(dst, i, result)
        i = i + 8
    }
}

The caller precomputes inv_std = 1.0 / std and neg_mean_div_std = -mean / std in Python. The kernel then does a single FMA per 8 elements -- one instruction that multiplies and adds simultaneously.

Dot product: dual-accumulator pattern

A naive dot product uses one accumulator:

let mut acc: f32x8 = splat(0.0)
while i < n {
    let a: f32x8 = load(x, i)
    let b: f32x8 = load(y, i)
    acc = fma(a, b, acc)
    i = i + 8
}

This leaves performance on the table. FMA has a latency of 4-5 cycles but a throughput of 1 per cycle. With one accumulator, each FMA waits for the previous one to finish.

Dual accumulators hide this latency by interleaving independent FMA chains:

export func dot_product(x: *f32, y: *f32, out: *mut f32, n: i32) {
    let mut acc0: f32x8 = splat(0.0)
    let mut acc1: f32x8 = splat(0.0)
    let mut i: i32 = 0
    while i < n - 15 {
        let a0: f32x8 = load(x, i)
        let b0: f32x8 = load(y, i)
        acc0 = fma(a0, b0, acc0)
        let a1: f32x8 = load(x, i + 8)
        let b1: f32x8 = load(y, i + 8)
        acc1 = fma(a1, b1, acc1)
        i = i + 16
    }
    while i < n {
        let a: f32x8 = load(x, i)
        let b: f32x8 = load(y, i)
        acc0 = fma(a, b, acc0)
        i = i + 8
    }
    let combined: f32x8 = acc0 .+ acc1
    let result: f32 = horizontal_sum(combined)
    store(out, 0, splat(result))
}

The CPU can execute fma(a0, b0, acc0) and fma(a1, b1, acc1) in parallel because they use different accumulators. This typically doubles throughput on modern CPUs.

Cosine similarity

Cosine similarity needs three reductions in one pass: dot(a, b), norm(a), and norm(b). Computing these separately means three passes over the data. Eä fuses all three:

export func cosine_similarity(a: *f32, b: *f32, out: *mut f32, n: i32) {
    let mut dot_acc: f32x8 = splat(0.0)
    let mut norm_a_acc: f32x8 = splat(0.0)
    let mut norm_b_acc: f32x8 = splat(0.0)
    let mut i: i32 = 0
    while i < n {
        let va: f32x8 = load(a, i)
        let vb: f32x8 = load(b, i)
        dot_acc = fma(va, vb, dot_acc)
        norm_a_acc = fma(va, va, norm_a_acc)
        norm_b_acc = fma(vb, vb, norm_b_acc)
        i = i + 8
    }
    let dot_val: f32 = horizontal_sum(dot_acc)
    let norm_a_val: f32 = horizontal_sum(norm_a_acc)
    let norm_b_val: f32 = horizontal_sum(norm_b_acc)
    let result: f32 = dot_val / sqrt(norm_a_val * norm_b_val)
    store(out, 0, splat(result))
}

Three FMAs per loop iteration, all operating on the same loaded vectors. The data passes through cache once. NumPy would need np.dot(a, b), np.linalg.norm(a), and np.linalg.norm(b) -- three separate passes.

Int8 quantized inference (ARM)

For quantized ML models using int8 weights, ARM provides dedicated matrix multiply instructions. On ARM with --i8mm (ARMv8.6-A, Cortex-A78+, Apple M1+):

export func matmul_i8_block(
    acc: *mut i32, activations: *i8, weights: *i8, n: i32
) {
    let mut a: i32x4 = splat(0)
    let mut i: i32 = 0
    while i < n {
        let act: i8x16 = load(activations, i)
        let wgt: i8x16 = load(weights, i)
        a = smmla_i32(a, act, wgt)
        i = i + 16
    }
    store(acc, 0, a)
}

smmla_i32 performs a 2x8 x 8x2 signed matrix multiply-accumulate in one instruction. Also available: ummla_i32 (unsigned x unsigned) and usmmla_i32 (unsigned activations x signed weights, the most common ML pattern).

For older ARM chips without I8MM, use vdot_i32 (requires --dotprod, ARMv8.2-A) for 4-way dot products, or wmul_i16/wmul_i32 for widening multiplies.

Fused Quantization Pipeline

Convert 32 float activations to int8 in 7 instructions, then feed directly to maddubs_i32:

kernel quantize_dot(activations: *const f32, weights: *const u8, out: *mut i32, inv_scale: f32, n: i32) {
    let f0: f32x8 = load(activations, i * 32);
    let f1: f32x8 = load(activations, i * 32 + 8);
    let f2: f32x8 = load(activations, i * 32 + 16);
    let f3: f32x8 = load(activations, i * 32 + 24);

    let scale: f32x8 = splat(inv_scale);
    let i0: i32x8 = round_f32x8_i32x8(f0 .* scale);
    let i1: i32x8 = round_f32x8_i32x8(f1 .* scale);
    let i2: i32x8 = round_f32x8_i32x8(f2 .* scale);
    let i3: i32x8 = round_f32x8_i32x8(f3 .* scale);

    let s01: i16x16 = pack_sat_i32x8(i0, i1);
    let s23: i16x16 = pack_sat_i32x8(i2, i3);
    let quant: i8x32 = pack_sat_i16x16(s01, s23);

    let w: u8x32 = load(weights, i * 32);
    let dot: i32x8 = maddubs_i32(w, quant);
    store(out, i * 8, dot);
}

Pipeline: 4x round + 2x pack_i32 + 1x pack_i16 = 7 instructions for 32 floats to 32 int8. Then 1x maddubs_i32 = 8 total.

Fused Quantization Pipeline (ARM)

The 256-bit intrinsics above are x86-only. On ARM, use the 128-bit cross-platform variants to process 16 floats at a time:

kernel quantize_dot_arm(activations: *const f32, weights: *const i8, out: *mut i32, inv_scale: f32, n: i32) {
    let f0: f32x4 = load(activations, i * 16);
    let f1: f32x4 = load(activations, i * 16 + 4);
    let f2: f32x4 = load(activations, i * 16 + 8);
    let f3: f32x4 = load(activations, i * 16 + 12);

    let scale: f32x4 = splat(inv_scale);
    let i0: i32x4 = round_f32x4_i32x4(f0 .* scale);
    let i1: i32x4 = round_f32x4_i32x4(f1 .* scale);
    let i2: i32x4 = round_f32x4_i32x4(f2 .* scale);
    let i3: i32x4 = round_f32x4_i32x4(f3 .* scale);

    let s01: i16x8 = pack_sat_i32x4(i0, i1);
    let s23: i16x8 = pack_sat_i32x4(i2, i3);
    let quant: i8x16 = pack_sat_i16x8(s01, s23);

    let w: i8x16 = load(weights, i * 16);
    let dot: i32x4 = vdot_i32(splat(0), quant, w);
    store(out, i * 4, dot);
}

Compile with --dotprod for vdot_i32. The round_f32x4_i32x4, pack_sat_i32x4, and pack_sat_i16x8 intrinsics are cross-platform and require no extra flags.

Batch operations

For ML workloads, you often apply the same operation to many rows. The Python side handles the loop over rows, calling the Eä kernel for each:

import ea

lib = ea.load("similarity.ea")
for i in range(num_queries):
    lib.cosine_similarity(query[i], database[j], result, dim)

The kernel handles the inner loop (over vector dimensions) with SIMD. The outer loop (over queries or rows) stays in Python. This is the right split -- the inner loop is where SIMD matters.

Real-world package

eavec implements vector similarity search using the dual-accumulator FMA pattern. It computes cosine similarity across a database of vectors, returning the top-k most similar results.

Native f16 Inference

LLM weights ship as f16. KV cache rows are f16. Attention reads f16 every step. Until v1.11.0, every one of those reads paid for a cvt_f16_f32 into an f32 register before any compute could happen — the storage was narrow, but the SIMD math was wide. On Pi 5 (Cortex-A76 with FEAT_FP16 silicon) that round-trip is now optional. Compile with --fp16 and f16x8 arithmetic lowers straight to NEON's fadd v.8h / fmul v.8h / fmla v.8h — half the register pressure, no cvt, no widening.

This is opt-in. Existing code that uses cvt_f16_f32 / cvt_f32_f16 keeps working unchanged — they're stable on any AArch64. --fp16 is the "compute stays in half-precision the whole way" mode for hot paths that load f16, multiply f16, and write f16 back.

Where the round-trip hurts

In an LLM attention kernel, every token's Q ⋅ Kᵀ inner product loads a row of cached keys. Storage is f16 — that's the whole point of a half KV cache. Compute is f32. The original portable shape looks like this:

// Before: f32 round-trip on every f16 load.
// Storage is f16, compute is f32, every load pays the cvt cost.
export func rmsnorm_via_f32(x: *i16, scale: f32, out: *mut i16) {
    let v_h: i16x4 = load(x, 0)
    let v: f32x4 = cvt_f16_f32(v_h)
    let sq: f32x4 = v .* v
    let sum: f32 = reduce_add(sq)
    let n: f32 = 4.0
    let eps: f32 = 0.000001
    let denom: f32 = sqrt(sum / n + eps)
    let inv: f32 = 1.0 / denom
    let inv_v: f32x4 = splat(inv * scale)
    let r: f32x4 = v .* inv_v
    let r_h: i16x4 = cvt_f32_f16(r)
    store(out, 0, r_h)
}

(Half-precision values are addressed as i16 here because the cvt-based path predates the f16 type — cvt_f16_f32(i16x4) is the documented bit-level entry point.)

That kernel works on any AArch64. But every cvt_f16_f32 is real work — a fcvtl v.4s, v.4h instruction, plus the data spends the whole compute pass in 128-bit f32 registers, halving how many lanes fit and doubling register pressure on FMA chains.

What --fp16 changes

With the new f16 scalar / f16x4 / f16x8 types and --fp16 enabled, the same shape stays in half-precision end to end:

// Native f16 — no f32 round-trip on the per-element path.
// Storage is f16; compute is f16 except for the single scalar sqrt
// (no useful f16 scalar sqrt on Cortex-A76; the round-trip is one
// operation, not N).
export func rmsnorm_f16(x: *f16, scale: f16, out: *mut f16) {
    let v: f16x8 = load(x, 0)
    let sq: f16x8 = v .* v
    let sum: f16 = reduce_add(sq)

    // Scalar branch: sum/N + eps, sqrt, reciprocal. Done in f32 because
    // there is no single-lane f16 sqrt instruction on Cortex-A76.
    let n: f16 = to_f16(8.0)
    let mean_h: f16 = sum / n
    let mean_f: f32 = to_f32(mean_h)
    let eps: f32 = 0.000001
    let inv_f: f32 = 1.0 / sqrt(mean_f + eps)
    let inv_h: f16 = to_f16(inv_f)

    // Back to f16x8 for the per-element multiply.
    let inv_v: f16x8 = splat(inv_h)
    let scale_v: f16x8 = splat(scale)
    let r: f16x8 = v .* inv_v .* scale_v
    store(out, 0, r)
}

Build with:

ea rmsnorm.ea --lib --fp16 --target-triple=aarch64-unknown-linux-gnu

The vector multiplies and the reduction now lower to fmul v.8h and faddv h, v.8h directly — the LLVM IR has <8 x half> everywhere on the per-lane path, and fpext only appears around the scalar sqrt. Eight lanes fit in a single Q register instead of eight lanes spread across two f32 registers, so chained FMAs hit twice the per-cycle throughput on the A76's pipelines.

Attention dot-product

The same idea applies to the inner product over the KV row. The pre---fp16 version converted on every load; the native form fuses without the cvt:

// Attention dot product over an f16 KV cache row.
// Before --fp16: each f16x8 load became cvt_f16_f32 to an f32x8 register
// before any compute. With --fp16, the cvt is gone — fma runs directly
// on <8 x half> registers.
export func attn_dot_f16(q: *f16, k: *f16, n: i32) -> f16 {
    let mut acc: f16x8 = splat(to_f16(0.0))
    let mut i: i32 = 0
    while i < n {
        let qv: f16x8 = load(q, i)
        let kv: f16x8 = load(k, i)
        acc = fma(qv, kv, acc)
        i = i + 8
    }
    return reduce_add(acc)
}

Each loop iteration is one ld1 v.8h × 2, one fmla v.8h, no cvt. The load-to-FMA chain stays in 128-bit Q registers the whole way.

When to reach for it

Use native f16 when:

  • Storage is f16 and compute is dominated by element-wise multiply-add (KV cache reads in attention, weight-times-activation in MLP rows, RMSNorm scaling, RoPE rotation).
  • The target hardware actually has FEAT_FP16 (Cortex-A76 / A78 / X1+, Apple M-series, Neoverse V1+). On hardware without it, --fp16 errors before codegen — there's no silent fallback.
  • You measure first. The win comes from register pressure and SIMD width, not from the cvt instruction itself; bandwidth-bound shapes that already saturate memory won't speed up.

Keep the f32 round-trip when:

  • A single scalar sqrt / divide / reciprocal lives on the critical path. The cvt is one instruction; f32's better-conditioned scalar math is worth more than the cvt costs. The rmsnorm_f16 example above does exactly this for the per-batch sqrt.
  • The kernel needs to ship to non-FEAT_FP16 hardware. The cvt_f16_f32 / cvt_f32_f16 path is the portable form and does not go away in v1.11.0 — it works with or without --fp16.

Performance

Quantitative numbers ship with Phase 6 of the v1.11.0 audit (post-merge). The architectural story is straightforward: the cvt is gone from the load-to-compute path and the lane count per register doubles from 4 to 8, so attention / RMSNorm / RoPE hot paths fall to roughly the same per-instruction shape as a hand-written vfmlal_low_f16 chain. Olorin's gemma4 inference uses this path under --fp16. See the spec for the design rationale.

See also

  • End-to-end test fixture: tests/data/rmsnorm_f16.ea
  • ARM FP16 test suite: tests/phase14_arm_fp16.rs

Fast Transcendentals: exp_poly_f32

Eä's exp() intrinsic calls libm via llvm.exp.v*f32. LLVM has no hardware vector exp on any current ISA — not on AVX2, not on AVX-512, not on NEON, not on SVE. So exp(f32x8) always lowers to a loop of eight sequential expf calls. The SIMD pattern in the source is cosmetic: throughput stays at one element per scalar libm call.

In a softmax loop or a tanh-GELU activation, that scalarization is the entire kernel. exp_poly_f32(f32xN) -> f32xN (new in v1.11.0) replaces it with a polynomial that stays in SIMD registers — seven to eight FMAs per lane, no libm call, no scalarization.

The throughput win depends on the baseline:

  • Modern x86 with a fast expf in glibc (AMD Zen 4 / glibc 2.42 on our reference benchmark): 2.93× isolated, 2.60× inside softmax.
  • Pi 5 (ARM Cortex-A76, glibc expf, no libmvec) measured inside a real GELU kernel in Olorin: 2.23× end-to-end on gemma4_gelu, consistent across 64-12288-lane shapes.
  • Older libm or scalar-only environments without a vectorized expf: the gap widens — the spec's original "~10×" headline holds against the slowest baselines but does not match every modern glibc.

The win is real on every baseline measured; the magnitude is environment-sensitive (glibc 2.42's expf is faster than the spec assumed, which compresses the headline).

The contract

exp_poly_f32 is not a drop-in replacement for exp(). It trades two things for the throughput:

  1. Bounded input range. It is defined on [-50, 50] per lane. Outside that range the polynomial diverges — no NaN or Inf guarantees, no clamping. The caller clamps if their inputs may exceed.
  2. Bounded accuracy. Relative error is ≤ 2⁻¹⁸ (~3.8e-6) inside the safe range. That's enough for softmax (normalize-by-sum absorbs the error) and for tanh-GELU activations. It is not enough for anything that needs full f32 precision — keep exp() for those.

The name encodes the trade. Reading exp_poly_f32 at a call site, you know it's a polynomial on f32 lanes — same explicitness style as pack_sat_i32x8, widen_u8_i32x4, cvt_f16_f32.

Softmax

The canonical use case is numerically-stable softmax over a small fixed window. The maximum subtract keeps inputs in a tight range, exp is the dominant cost, and the final normalize absorbs the polynomial's relative error.

// Stable softmax over 8 lanes using exp_poly_f32.
// reduce_max keeps inputs in a tight range so we never approach the
// [-50, 50] contract edge — the test fixture uses values 1..8 and the
// shifted range is [-7, 0].
export func softmax(x: *f32, out: *mut f32) {
    let v: f32x8 = load(x, 0)
    let mx: f32 = reduce_max(v)
    let mxv: f32x8 = splat(mx)
    let shifted: f32x8 = v .- mxv
    let ev: f32x8 = exp_poly_f32(shifted)
    let s: f32 = reduce_add(ev)
    let inv: f32 = 1.0 / s
    let invv: f32x8 = splat(inv)
    let r: f32x8 = ev .* invv
    store(out, 0, r)
}

The full integration test for this shape (test_exp_poly_f32_softmax_integration in tests/phase14_exp_poly.rs) compares against expf-based reference softmax with relative tolerance 1e-3 (~2⁻¹⁰) — the polynomial's 2⁻¹⁸ error compounds through reduce_add, but the normalize-by-sum still collapses it to a tolerance that's well below softmax's typical numerical envelope.

tanh-GELU via tanh_approx_f32

The GELU activation gelu(x) ≈ 0.5 · x · (1 + tanh(c · (x + 0.044715 · x³))) with c = sqrt(2/π) is the canonical use of a fast vector tanh. tanh_approx_f32(f32xN) -> f32xN (new in v1.14.0) lowers to a rational P(x²) · x / Q(x²) minimax-tuned for f32, branchless and bounded to ~3e-7 absolute error across the body, with internal clamping at ±9 handling saturation:

// GELU(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
export func gelu_tanh(input: *f32, output: *mut f32) {
    let x: f32x8 = load(input, 0)

    // inner = sqrt(2/pi) * (x + 0.044715 * x^3)
    let c0: f32x8 = splat(0.7978845608)
    let c1: f32x8 = splat(0.044715)
    let xsq: f32x8 = x .* x
    let xcube: f32x8 = xsq .* x
    let inner: f32x8 = c0 .* fma(c1, xcube, x)

    // tanh_approx_f32 clamps internally; no defensive bound needed.
    let t: f32x8 = tanh_approx_f32(inner)

    // gelu = 0.5 * x * (1 + t)
    let half: f32x8 = splat(0.5)
    let one: f32x8 = splat(1.0)
    let result: f32x8 = half .* x .* (one .+ t)
    store(output, 0, result)
}

This replaces the earlier (exp_poly_f32(2x) - 1) / (exp_poly_f32(2x) + 1) algebraic-identity workaround. The workaround suffers catastrophic cancellation in the numerator for small |x| (e^{2·1e-3} - 1 loses ~3 digits of precision in f32), degrading to ~10⁻³ relative error near zero where GELU is most sensitive. The dedicated intrinsic skips both that pitfall and the second arithmetic chain, while costing one fdiv instead of one fdiv plus the exp_poly_f32 polynomial.

Safely clamping in-range

The contract is bounded on [-50, 50] per lane, so the safest defensive pattern is the min-then-max chain shown above:

let hi: f32x8 = splat(50.0)
let lo: f32x8 = splat(-50.0)
let safe: f32x8 = min(max(unbounded, lo), hi)
let e: f32x8 = exp_poly_f32(safe)

Both min and max lower to single NEON fminnm / fmaxnm (or x86 vminps / vmaxps) instructions, so the clamp is two FMAs of overhead on a polynomial that's already 7-8 FMAs per lane.

If you can prove statically that your input range is bounded — softmax after a reduce_max subtract, for instance — skip the clamp. The guarantee is mathematical, not enforced; the polynomial does not check.

When to keep exp()

exp_poly_f32 is the right tool for SIMD hot paths where 2⁻¹⁸ relative error is acceptable. Examples: softmax, GELU, attention scoring, exponential decay in beam search.

Keep exp() (which calls libm) for:

  • Scalar code, or vector code where the SIMD scalarization isn't visible (rare loops, one-shot computations).
  • Code that needs full f32 precision — physics, statistics, log-sum-exp inside an accuracy-critical reduction.
  • Inputs that may exceed [-50, 50] and a libm-correct NaN/Inf is required.

See also

  • Test fixtures (accuracy, range, softmax integration): tests/phase14_exp_poly.rs

NEON Gather Workaround

gather(p: *f32, idx: i32xN) -> f32xN lowers to AVX2's vgatherdps on x86 — one instruction, four (or eight) indexed loads, one result vector. On AArch64 NEON there is no such instruction. Pi 5's Cortex-A76 has no SVE / SVE2 either, so there's no ld1w {z0.s}, p0/z, [x1, z1.s, sxtw] fallback. A real gather is hardware-impossible on this target.

Pre-v1.11.0, Eä errored on gather() for ARM with a message that said "use a scalar loop on ARM" — true but unhelpful, and the de-facto workaround everyone arrived at (a stack-buffer round-trip through memory) was strictly worse than the canonical scalar-compose pattern already used by llama.cpp's IQ kernels.

v1.11.0 ships two new vector-construction intrinsics — f32x4_from_scalars and f32x8_from_scalars — and rewrites the ARM gather() error to point at them by name plus docs/idioms/neon-gather.md. The pattern that used to be "figure it out yourself" is now a one-line compose with a documented idiom.

The IQ3 motivation

Olorin's IQ3_S / IQ3_XXS dequant kernels need exactly this. Each block holds packed 3-bit indices into a 256-entry lookup table of dequantized f32 values. The hot loop reads four indices, fetches four LUT entries, and writes them out as an f32x4 vector.

On x86 the kernel is one line:

// AVX2 gather: works on x86 but errors on ARM with a pointer to f32x{4,8}_from_scalars.
export func lut_dequant_x86(lut: *f32, idx: i32x4, out: *mut f32) {
    let v: f32x4 = gather(lut, idx)
    store(out, 0, v)
}

That kernel hard-errors on AArch64 at codegen time with:

gather has no NEON equivalent on ARM. Use scalar load_u32 + f32x4_from_scalars (or f32x8_from_scalars) to compose the result explicitly. See docs/idioms/neon-gather.md for the canonical pattern.

The compose pattern

The workaround is to read the four (or eight) values one at a time and build the vector explicitly. The new intrinsic does the building:

// IQ3 LUT dequant pattern: gather four entries from a lookup table.
export func iq3_lut_dequant_lane(lut: *f32, indices: *i32, out: *mut f32) {
    let i0: i32 = indices[0]
    let i1: i32 = indices[1]
    let i2: i32 = indices[2]
    let i3: i32 = indices[3]
    let v0: f32 = lut[i0]
    let v1: f32 = lut[i1]
    let v2: f32 = lut[i2]
    let v3: f32 = lut[i3]
    let v: f32x4 = f32x4_from_scalars(v0, v1, v2, v3)
    store(out, 0, v)
}

The same shape works for an 8-wide gather over an i32x8 index vector via f32x8_from_scalars(...) — same idea, eight scalar loads instead of four.

For the real per-row loop, the four scalar reads sit visibly inside an ordinary while loop:

// IQ3 LUT dequant row: walk an index array, fetch 4 LUT entries at a time,
// store into the output stream. The four scalar loads sit visibly in the
// source — the programmer sees the cost.
export func iq3_dequant_row(lut: *f32, indices: *i32, out: *mut f32, n: i32) {
    let mut i: i32 = 0
    while i < n {
        let v0: f32 = lut[indices[i]]
        let v1: f32 = lut[indices[i + 1]]
        let v2: f32 = lut[indices[i + 2]]
        let v3: f32 = lut[indices[i + 3]]
        let v: f32x4 = f32x4_from_scalars(v0, v1, v2, v3)
        store(out, i, v)
        i = i + 4
    }
}

Build with --target-triple=aarch64-unknown-linux-gnu. LLVM's NEON lowering folds the insertelement chain inside f32x4_from_scalars into a sequence of ins v.s[i] instructions — the same code GCC and Clang produce for vsetq_lane_f32 chains. No stack buffer, no load-and-shuffle dance.

Why no silent fallback?

Eä's design rule is "the programmer sees the cost." A silent scalar fallback for gather() on ARM would hide a 4-8× slowdown behind syntax that looks like SIMD. The compose pattern keeps the cost explicit: four scalar reads appear as four scalar reads in the source. A profiler shows four ldr s0, [x1, x2, lsl #2] instructions, not a deceptively single-line gather that hides them. Reviewer reading the diff sees the loads. Future you reading the source in six months sees the loads. There is no hidden performance cliff.

This is the same philosophy that gates --fp16 behind a flag and errors on f16x8 arithmetic without it: explicit cost beats convenient syntax that masks slowdown.

When SVE2 lands

If your ARM target is Apple M4, Graviton 3+, or Snapdragon X, SVE2 does have a real gather (ld1w with a scatter-gather addressing mode). Eä's SVE2 codegen is deferred to a future release. Until then, the compose pattern is the universal AArch64 workaround: it works on Pi 5 today, and it stays correct on M4 / Graviton even after a hardware-gather lowering lands (it just gets superseded by the better path on that target).

See also

  • Terse reference idiom: docs/idioms/neon-gather.md
  • Test coverage: tests/phase14_arm_neon.rs (test_f32x4_from_scalars, test_f32x8_from_scalars, test_gather_on_arm_points_to_compose)

Beating NumPy's BLAS at Constant-Q Transform with 80 Lines of Eä

A SIMD kernel compiler, a DFT nobody asked for, and an honest benchmark.


The Problem

Audio spectrum visualizers typically use FFT. FFT gives you linearly-spaced frequency bins — great for math, terrible for music. Human pitch perception is logarithmic: the distance from C2 to C3 (one octave) matters as much as C5 to C6. But FFT allocates equal resolution across the entire spectrum, wasting detail in the bass and over-resolving the treble.

The Constant-Q Transform (CQT) solves this. Each frequency bin gets its own window length — long windows for low frequencies (more cycles needed to resolve pitch), short windows for high frequencies. The result: 84 bins spanning 7 octaves, one per semitone, C2 through B8.

The catch: FFT can't do CQT. You need a DFT-per-bin, which is O(n*k) instead of O(n log n). The standard NumPy approach is to build a complex kernel matrix and let BLAS handle it with a single matrix-vector multiply.

We wanted to see if a hand-written SIMD kernel in could beat that.

What is Eä

Eä is a compute kernel compiler. You write .ea files in a C-like language with explicit SIMD vector types, compile them to native shared libraries, and call them from Python with NumPy arrays. No C toolchain, no Cython, no JIT warmup.

export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
    let s: f32x8 = splat(factor)
    let mut i: i32 = 0
    while i < n {
        let v: f32x8 = load(src, i)
        store(dst, i, v .* s)
        i = i + 8
    }
}

Eä doesn't have sin or cos intrinsics. This is a deliberate design choice — trig is a policy decision (how many polynomial terms? what accuracy? what range?) that the language refuses to hide behind a simple-looking function call. If you need trig, you precompute tables in Python or write an explicit FMA polynomial chain.

This turns out to be the key insight for CQT.

The Design

A CQT with 84 bins (7 octaves, 12 semitones each) starting at C2 (65.4 Hz) at 44.1 kHz sample rate has these properties:

BinNoteFrequencyWindow Length
0C265.4 Hz11,339 samples
33A4440 Hz1,684 samples
83B87,903 Hz94 samples

The quality factor Q = 16.8 (constant across all bins — that's what "constant-Q" means). Total work per frame: 200,476 FMA operations across all bins.

The approach: precompute cos/sin twiddle factor tables in Python (one-time cost, 1.6 MB), then let Eä handle the per-frame FMA loop. We bake the Hann window directly into the twiddle factors during precomputation:

# Python: precompute once
for k in range(n_bins):
    n_k = int(lengths[k])
    i = np.arange(n_k, dtype=np.float64)
    window = 0.5 * (1 - np.cos(2 * np.pi * i / n_k))
    angle = 2 * np.pi * freqs[k] * i / SAMPLE_RATE
    cos_parts.append((window * np.cos(angle)).astype(np.float32))
    sin_parts.append((window * -np.sin(angle)).astype(np.float32))

This means the Eä kernel needs zero trig and zero windowing logic. The inner loop is pure FMA.

The Kernel

Version 1: Direct DFT with Dual Accumulators

The first kernel processes one frequency bin at a time. For each bin, it reads the audio segment, multiplies against precomputed cos and sin twiddle factors, accumulates, and computes the magnitude. Smoothing (exponential decay for the "falling bars" effect) is fused into the output — no separate kernel call, no intermediate array.

Two independent FMA accumulator chains (r0/r1, i0/i1) hide pipeline latency on superscalar CPUs. This is the same technique used in high-performance reduction kernels.

export func cqt_fused(
    audio: *f32,
    cos_table: *f32,
    sin_table: *f32,
    offsets: *i32,      // per-bin start offset into twiddle tables
    lengths: *i32,      // per-bin window length
    prev_mags: *f32,
    out_mags: *mut f32,
    alpha: f32,         // decay factor
    n_bins: i32,
    max_window_len: i32
) {
    let mut k: i32 = 0
    while k < n_bins {
        let off: i32 = offsets[k]
        let win_len: i32 = lengths[k]
        let audio_start: i32 = max_window_len - win_len

        let mut r0: f32x8 = splat(0.0)
        let mut r1: f32x8 = splat(0.0)
        let mut i0: f32x8 = splat(0.0)
        let mut i1: f32x8 = splat(0.0)

        let mut i: i32 = 0
        while i + 16 <= win_len {
            prefetch(cos_table, off + i + 64)
            prefetch(sin_table, off + i + 64)

            let s0: f32x8 = load(audio, audio_start + i)
            let s1: f32x8 = load(audio, audio_start + i + 8)
            let c0: f32x8 = load(cos_table, off + i)
            let c1: f32x8 = load(cos_table, off + i + 8)
            let sn0: f32x8 = load(sin_table, off + i)
            let sn1: f32x8 = load(sin_table, off + i + 8)

            r0 = fma(s0, c0, r0)
            r1 = fma(s1, c1, r1)
            i0 = fma(s0, sn0, i0)
            i1 = fma(s1, sn1, i1)

            i = i + 16
        }

        // 8-element and scalar tails omitted for brevity

        let real: f32 = reduce_add(r0 .+ r1)
        let imag: f32 = reduce_add(i0 .+ i1)
        let mag: f32 = sqrt(real * real + imag * imag)

        // Fused smooth decay
        let decayed: f32 = prev_mags[k] * alpha
        if mag > decayed {
            out_mags[k] = mag
        } else {
            out_mags[k] = decayed
        }

        k = k + 1
    }
}

The generated assembly for the hot loop is clean — the Eä compiler folds twiddle loads directly into vfmadd231ps memory operands, so the audio loads (vmovups) are the only explicit loads:

.LBB0_4:
    prefetcht0  (%rsi,%r12,4)      ; prefetch cos_table
    prefetcht0  (%rdx,%r12,4)      ; prefetch sin_table
    vmovups     (%rdi,%r12,4), %ymm5   ; load audio[i..i+7]
    vmovups     (%rdi,%r12,4), %ymm6   ; load audio[i+8..i+15]
    vfmadd231ps (%rsi,%r12,4), %ymm5, %ymm3  ; r0 += audio * cos
    vfmadd231ps (%rsi,%rbp,4), %ymm6, %ymm4  ; r1 += audio * cos
    vfmadd231ps (%rdx,%r12,4), %ymm5, %ymm2  ; i0 += audio * sin
    vfmadd231ps (%rdx,%rbp,4), %ymm6, %ymm1  ; i1 += audio * sin

Four FMAs per iteration, each operating on 8 floats. 64 floating-point multiply-adds per loop cycle.

First Benchmark: The Lie

Our first benchmark compared this against a Python for-loop CQT:

def numpy_cqt(audio, freqs, max_window_len):
    magnitudes = np.empty(N_BINS, dtype=np.float32)
    for k in range(N_BINS):
        n_k = int(np.ceil(Q * SAMPLE_RATE / freqs[k]))
        segment = audio[start:start + n_k]
        window = 0.5 * (1 - np.cos(2 * np.pi * i / n_k))
        kernel = window * np.exp(-2j * np.pi * freqs[k] * i / SAMPLE_RATE)
        magnitudes[k] = np.abs(np.sum(segment * kernel))
    return magnitudes

Result: 153x faster. We almost shipped this number.

The problem: this "NumPy CQT" is a Python for-loop with per-bin allocations. Nobody would write production CQT this way. The honest NumPy approach is to precompute a (84, 11339) complex kernel matrix and do a single np.dot:

# Precompute once: (n_bins, max_window_len) complex64 matrix
kernel_matrix = np.zeros((N_BINS, max_window_len), dtype=np.complex64)
for k in range(N_BINS):
    # ... fill in windowed complex exponentials, zero-pad shorter bins

# Per frame: single BLAS call
magnitudes = np.abs(kernel_matrix @ audio)

This calls directly into OpenBLAS/MKL — decades of hand-tuned assembly for matrix-vector multiply.

Honest Benchmark v1

Methodology:

  • All competitors precompute everything they can (tables, windows, indices)
  • Only per-frame work is timed: transform + magnitude + smoothing
  • Same smoothing (max(current, previous * alpha)) applied to all

With N=500 iterations and inadequate warmup:

Time
Eä CQT0.098 ms
NumPy matmul0.085 ms

NumPy was winning. The 153x headline collapsed to Eä being 1.1x slower.

Fixing the Benchmark

The N=500 result was noisy. WSL2 scheduling and insufficient warmup made the numbers unreliable. We moved to a min-of-trials methodology: 10 independent trials of 2,000 iterations each, reporting both min (best-case, no OS interference) and median (real-world scheduling).

With proper warmup (200+ iterations before timing):

MinMedian
Eä CQT50 us55 us
NumPy matmul95 us190 us

Eä was already 1.9x faster at best, 3.5x at median. The initial "1.1x slower" was just noise.

Optimization Attempts

Quad Accumulators

Theory: 4 independent FMA chains (8 accumulator registers) should hide more pipeline latency than 2 chains.

Result: slower (100 us vs 83 us). With 8 YMM accumulator registers + 12 data registers for loads, we exceed x86-64's 16 YMM register limit. The compiler spills to stack memory, killing the very latency hiding we wanted.

Interleaved Loads

Theory: load cos, immediately FMA, then load sin and FMA. Give the memory subsystem more time to fetch by interleaving loads with compute.

Result: slower (191 us vs 53 us). The original pattern — batch all loads, then batch all FMAs — is better for out-of-order execution. The CPU's reorder buffer can schedule loads far ahead of their consumers when they're grouped together.

Merged Twiddle Table

Theory: interleave cos and sin data for each bin into a single contiguous memory region [cos_bin0 | sin_bin0 | cos_bin1 | ...] to halve cache line traffic.

Result: no change (53.8 us vs 53.1 us). The hardware prefetcher already handles two sequential streams efficiently. The second stream (sin_table) gets prefetched in parallel with the first (cos_table) because the access pattern is identical — sequential scan with the same stride.

In-Place Smoothing

Instead of writing to out_mags and copying back to prev_mags in Python, the kernel reads and writes the same buffer:

export func cqt_inplace(
    audio: *f32,
    cos_table: *f32,
    sin_table: *f32,
    offsets: *i32,
    lengths: *i32,
    mags: *mut f32,    // read previous, write smoothed result
    alpha: f32,
    n_bins: i32,
    max_window_len: i32
)

Result: ~5% faster (47.8 us vs 50.2 us). Modest win from eliminating a 336-byte numpy copy per frame plus one Python function call.

Final Numbers

Methodology: 10 trials of 2,000 iterations each, 200-iteration warmup per trial, min-of-trials reported. All precomputation excluded. All competitors include smoothing. NumPy uses pre-allocated output buffers (out= parameter) to avoid allocation overhead.

MinMedian
Eä CQT (fused SIMD)~50 us~55 us
NumPy CQT (BLAS matmul)~95 us~190 us
NumPy FFT (wrong result)~106 us~118 us

Eä vs BLAS: 1.9x faster (min), 3.5x (median) Eä vs FFT: 2.1x faster — and FFT gives the wrong answer Throughput: 7.8 GFLOP/s

Why Eä Wins

1. Eä reads 4.8x less data. The BLAS matmul uses a dense (84, 11339) complex64 matrix — 7.4 MB per frame, including all the zero-padded regions where shorter bins have no data. Eä's variable-length inner loops skip zeros entirely, touching only 200,476 real elements (1.6 MB).

2. Real vs complex arithmetic. BLAS operates on complex64 (pairs of float32). Every BLAS FMA processes a complex multiply-add — 4 real multiplies and 2 real adds per element. Eä works directly in float32, doing 2 real FMAs per element (one for the cos component, one for sin).

3. Fused pipeline. Eä does window + transform + magnitude + smoothing in one function call. NumPy requires 4 separate calls: matmul, abs, maximum, copyto. Each call crosses the Python/C boundary, allocates or writes to buffers, and makes a separate pass over the output data.

4. Lower scheduling jitter. One ctypes call has less interrupt surface than four NumPy calls. This explains the median gap (3.5x) being larger than the min gap (1.9x) — Eä's single kernel call is less likely to be interrupted mid-computation.

Why Eä Doesn't Win More

The dual-accumulator inner loop is already close to optimal for AVX2. The generated assembly has 4 vfmadd231ps instructions per iteration with memory operands — the compiler is folding loads into FMAs. There's no instruction waste.

The bottleneck is memory bandwidth. At 50 us for 1.6 MB of twiddle table reads, we're pulling ~32 GB/s — respectable for L3 cache, but below the theoretical memory bandwidth. The per-bin overhead (loading offsets/lengths, setting up accumulators, horizontal reduction) fragments what could be one continuous stream into 84 separate sweeps.

BLAS doesn't have this problem. Its matmul is one uninterrupted sweep through a dense matrix, which modern CPUs and prefetchers are specifically optimized for. BLAS trades compute (touching zeros) for access pattern regularity — and on some runs, when the OS cooperates perfectly, that trade nearly pays off.

The Code

Three files. The full kernel (cqt.ea, 80 lines), the Python visualizer, and the benchmark. The kernel uses the language as intended: explicit types, explicit vector widths, explicit memory access. No hidden costs.

import ea
import numpy as np

kernel = ea.load("cqt.ea")

# Precompute tables once
freqs = F_MIN * 2 ** (np.arange(84) / 12)
lengths = np.ceil(Q * SAMPLE_RATE / freqs).astype(np.int32)

# ... build cos/sin twiddle tables with baked-in Hann window ...

# Per frame: one call, ~50 us
kernel.cqt_fused(
    audio_buffer, cos_table, sin_table,
    offsets, lengths, prev_mags, out_mags,
    decay_alpha, n_bins, max_window_len,
)

The full source is on GitHub.

What We Learned

The first benchmark said 153x. That was a lie — comparing optimized SIMD against a Python for-loop. The second benchmark said 1.1x slower. That was noise — insufficient warmup on a noisy WSL2 scheduler. The real number is 1.9x faster, earned by reading less data, using simpler arithmetic, and fusing the pipeline.

Three optimization attempts failed (quad accumulators, interleaved loads, merged tables) and one gave a modest 5% win (in-place smoothing). The lesson: when the compiler is already generating clean assembly, micro-optimizations rarely help. The wins came from the algorithm design — variable-length windows, baked-in windowing, fused smoothing — not from squeezing the inner loop.

Eä didn't win by being a faster compiler. It won by making it easy to write the right algorithm. A kernel that skips zeros, fuses four operations, and makes one function call will beat a kernel that touches every element of a zero-padded matrix and makes four function calls — even when the latter is backed by BLAS.

Searching Encrypted Logs Without Decrypting Them (Much)

Or: what happens when you let a SIMD compiler loose on a crypto problem.


The Problem That Shouldn't Exist

You have 100 GB of encrypted log files. Something is on fire. You need to find every line containing "ERROR".

Here's your options:

Option A: The "correct" way. Decrypt the entire file to /tmp. Grep it. Delete the decrypted copy. Hope nobody read it while it was sitting there. Also hope your 50 GB /tmp partition can handle it. Also hope you remembered to shred instead of rm.

Option B: The galaxy brain way. Use Fully Homomorphic Encryption to search directly on ciphertext. Wait approximately until the heat death of the universe. Receive your answer. Realize the fire already consumed the building.

Option C: What we actually built. Decrypt 4 KB at a time into a tiny buffer. Search it. Zero the buffer. Move on. The plaintext exists for about as long as a TikTok attention span.

We went with Option C.

The "Fusion" Trick

Here's the thing about ChaCha20 — it's a stream cipher. You XOR a keystream with your plaintext, byte by byte. Decryption is the same operation: XOR the keystream with ciphertext, get plaintext back.

So why not XOR and search at the same time?

Standard pipeline:
  decrypt(ciphertext) → plaintext_file → grep(plaintext_file) → results
  Memory: 100 GB plaintext on disk. Two passes over data.

Fused pipeline:
  for each 4 KB chunk:
    decrypt → search → extract lines → zero buffer
  Memory: 4 KB. One pass. Plaintext never on disk.

This is called loop fusion — cramming multiple operations into a single loop so you only read the data once. It's the same trick that makes a * b + c with FMA faster than doing a * b then + c separately. Except here we're fusing cryptographic decryption with string search.

And because we're writing in — a language where you control the SIMD instructions directly — the "search" part isn't some scalar byte-by-byte comparison. It's the same algorithm glibc uses in memmem:

// For each 16-byte chunk of decrypted plaintext:
let bits: i32 = movemask(chunk .== splat(first_byte))
if bits == 0 {
    skip  // no match possible in these 16 bytes
}
// else: verify the candidate positions

vpcmpeqb + vpmovmskb. Compare 16 bytes at once, get a bitmask of hits. If the bitmask is zero, skip. That's it. Most chunks contain zero instances of the letter 'E', so most chunks are skipped entirely.

Multi-Needle: The Actually Useful Part

Searching for one string is cute. Searching for ["ERROR", "FATAL", "PANIC"] without decrypting three times — that's useful.

The v2 kernel takes multiple needles and OR:s their bitmasks:

bits = 0
for each unique first-byte:
    bits = bits + movemask(chunk .== splat(first_byte))
if bits == 0:
    skip  // none of the needles start with any byte in this chunk

One decryption pass. Three patterns. All the matched lines extracted with \n-boundary detection using the same SIMD trick.

The result:

from eachacha import encrypt, search

key = bytes(range(32))
nonce = bytes(12)
ct = encrypt(b"INFO ok\nERROR disk full\nFATAL crash\n", key, nonce)

result = search(ct, [b"ERROR", b"FATAL"], key, nonce)
for i in range(result.match_count):
    print(f"[{result.needle_ids[i]}] {result.lines[i]}")

# [0] b'ERROR disk full'
# [1] b'FATAL crash'

Three lines of Python. The search() function does the rest — picks the right kernel, allocates buffers, packs needles, calls the SIMD kernel, returns structured results.

The Numbers

AMD EPYC 9354P, 2 vCPUs, 64 MB test data:

Single needle:

ImplementationGB/s
Eä fused decrypt+search1.28
Decrypt then C memmem (two passes)0.96
C memmem on plaintext (no decrypt)2.22

Three needles:

ImplementationGB/s
Eä v2 multi-needle (one pass)0.52
Eä v1 called 3 times (three passes)0.41
C memmem x3 on plaintext0.78

The fused approach is 1.34× faster than decrypt-then-search for single needle, and 1.28× faster than three separate searches for multi-needle. Not because our search is faster — it's because we only decrypt once.

And yes, 0.52 GB/s on encrypted data is 67% of what C memmem achieves on plaintext. We're paying the ChaCha20 tax, but the fusion rebate is generous.

What We're NOT Claiming

Let's be honest about the security model, because someone on Hacker News will definitely yell at me if I'm not:

What we guaranteeWhat we don't claim
Plaintext never on disk"Zero RAM exposure"
4 KB buffer, zeroed per windowEquivalent to FHE
Only match offsets + lines leave the kernelSide-channel resistance
400-million-fold reduction in exposure surfaceThat this is a good idea for nuclear launch codes

This is a practical middle ground between "decrypt everything to a temp file" (what everyone does today) and "compute on ciphertext" (what nobody can afford to do today). It's not cryptographically novel. It's an engineering choice: minimize exposure, maximize throughput, accept the 4 KB window.

Also, no special hardware required. No SGX enclaves. No trust assumptions about chip manufacturers. Just software and a CPU that can XOR things fast.

The Autoresearch Detour

We tried to make it faster. The Eä project has an automated optimization loop (autoresearch) that iteratively modifies kernels and benchmarks them:

Iteration 1: Remove buffer zeroing → 0% improvement (zeroing isn't the bottleneck)
Iteration 2: unroll(10) on round loops → parse error (oops)
Iteration 3: 2-block ILP instead of 4-block → 24% SLOWER
Iteration 4: Inline rotation functions → 5% slower (LLVM already inlines them)
Iteration 5: Timeout (the kernel is 866 lines, agent couldn't finish)

Total improvement from 5 iterations: 0%.

The kernel is compute-bound. ChaCha20's 20 rounds of quarter-round operations dominate. The search and line extraction are fast — the bottleneck is the math. Which is honestly a good place to be: it means we're not leaving performance on the table.

2,098 Lines

That's the total across all four kernels:

KernelLinesWhat it does
chacha20.ea272Encrypt (1.78 GB/s)
chacha20_fused.ea384Encrypt + stats in one pass
chacha20_search.ea576Single-needle encrypted search
chacha20_search_v2.ea866Multi-needle + context lines

For comparison, OpenSSL's ChaCha20 implementation is somewhere north of 100,000 lines of C and assembly. Ours passes the same RFC 7539 test vectors and cross-verifies byte-for-byte with OpenSSL.

I'm not saying we're better than OpenSSL. OpenSSL in native C with AVX-512 would destroy us. But we're 272 lines that a human can read in 10 minutes, and we still hit 1.78 GB/s. That's the trade-off.

Try It

pip install eachacha
from eachacha import encrypt, search

key = bytes(range(32))
nonce = bytes(12)
ct = encrypt(b"your secret logs here", key, nonce)
result = search(ct, b"secret", key, nonce)
print(result.offsets)  # [5]

Or build from source for native CPU optimization:

pip install ea-compiler
git clone https://github.com/petlukk/eachacha
cd eachacha && ./build.sh && pip install -e .

Benchmarks measured on 2 vCPUs. Your numbers will be different, probably better. The ratios hold. Code, benchmarks, and 109 tests on GitHub.

Eä is open-source. The compiler is ~12,000 lines of Rust. Docs. GitHub.

How 30 Lines of Eä Beat NumPy by 6×

And why your framework is probably slower than a for-loop.


The Pitch

Here's a fused multiply-add: out[i] = a[i] * b[i] + c[i]. Sixteen million times.

NumPy does it in 46 milliseconds. Eä does it in 7 milliseconds. That's 6.6× faster.

Here's the kicker: the Eä kernel is 30 lines. The NumPy version is one line. And the one-liner loses because it's actually two lines pretending to be one.

The NumPy Version

out = a * b + c

Simple. Elegant. Two full scans of 64 MB of data.

NumPy computes a * b first, writes 64 MB to a temporary array, then reads it back to add c. That's 256 MB of memory traffic for 192 MB of actual data. Every element gets loaded, stored, loaded again, and stored again.

On a modern CPU at ~35 GB/s memory bandwidth, that's a floor of about 7 milliseconds just for memory. NumPy hits 46 ms because it also has Python dispatch overhead, array allocation, and — crucially — it can't fuse the multiply and add into a single instruction.

The Eä Version

export func fma_f32x8(
    a: *restrict f32,
    b: *restrict f32,
    c: *restrict f32,
    out: *mut f32,
    n: i32
) {
    let mut acc0: f32x8 = splat(0.0)
    let mut acc1: f32x8 = splat(0.0)
    let mut i: i32 = 0
    while i + 16 <= n {
        let va0: f32x8 = load(a, i)
        let vb0: f32x8 = load(b, i)
        let vc0: f32x8 = load(c, i)
        store(out, i, fma(va0, vb0, vc0))

        let va1: f32x8 = load(a, i + 8)
        let vb1: f32x8 = load(b, i + 8)
        let vc1: f32x8 = load(c, i + 8)
        store(out, i + 8, fma(va1, vb1, vc1))

        i = i + 16
    }
    while i < n {
        out[i] = a[i] * b[i] + c[i]
        i = i + 1
    }
}

One pass. Load a, b, c, fuse multiply-add in a single vfmadd instruction, store result. Each element is touched once. Total memory traffic: 256 MB (same 4 arrays), but only one scan instead of two.

The *restrict tells LLVM the arrays don't alias, enabling aggressive optimization. The f32x8 processes 8 floats per instruction on AVX2. The 2× unroll (processing 16 elements per iteration) hides memory latency.

That's it. No magic. Just not doing twice the work.

"But I'll Just Use Ray"

We benchmarked that too. For science.

MethodTimeThroughputvs NumPy
NumPy (multiply + add)46,000 µs5.6 GB/sbaseline
Eä (1 thread)6,900 µs37.0 GB/s6.6×
Eä (2 threads)6,500 µs39.1 GB/s7.0×
Dask (2 chunks)56,000 µs4.6 GB/s0.8×
Ray (2 workers)89,000 µs2.9 GB/s0.5×

Ray is twice as slow as NumPy. For FMA. On two cores. On the same machine.

Why? Serialization. Ray pickles your 64 MB arrays, sends them to worker processes, unpickles them, runs NumPy (which does two passes), pickles the results, and sends them back. The actual compute takes 46 ms. The overhead takes another 43 ms.

Dask is better — it chunks lazily and uses NumPy under the hood — but it still can't fuse operations or control SIMD width. It's paying for an abstraction layer over code that's already fast enough.

The uncomfortable truth: for single-machine numerical work, a for-loop that touches memory once will beat a distributed framework that touches memory twice.

The Dot Product Story

We weren't always winning. Our first benchmark had Eä's dot product at 0.27× of BLAS. Embarrassing.

// The naive version — DO NOT ship this
export func dot_naive(a: *f32, b: *f32, n: i32) -> f32 {
    let mut acc: f32 = 0.0
    let mut i: i32 = 0
    while i < n {
        acc = acc + a[i] * b[i]
        i = i + 1
    }
    return acc
}

Single scalar accumulator. Each iteration depends on the previous one. The CPU stalls waiting for the addition to complete before it can start the next multiply. Classic dependency chain bottleneck.

BLAS uses the trick every HPC programmer knows: multiple independent accumulators.

// The optimized version — ships in autoresearch/kernels/dot_product/
export func dot_f32x8(a: *restrict f32, b: *restrict f32, len: i32) -> f32 {
    let mut acc0: f32x8 = splat(0.0)
    let mut acc1: f32x8 = splat(0.0)
    let mut i: i32 = 0
    while i + 32 <= len {
        acc0 = fma(load(a, i),      load(b, i),      acc0)
        acc1 = fma(load(a, i + 8),  load(b, i + 8),  acc1)
        acc0 = fma(load(a, i + 16), load(b, i + 16), acc0)
        acc1 = fma(load(a, i + 24), load(b, i + 24), acc1)
        i = i + 32
    }
    // ... tail handling ...
    return reduce_add_fast(acc0 .+ acc1)
}

Two f32x8 accumulators (acc0, acc1) with 4× unroll. The CPU has 16 FMA operations in flight that don't depend on each other. The result:

MethodTimeGB/svs BLAS
NumPy BLAS sdot3,535 µs35.9baseline
Eä naive (scalar)13,222 µs9.70.27×
Eä f32x4 (1 acc)4,474 µs28.60.79×
Eä f32x8 (dual acc, 4× unroll)3,500 µs36.61.01×

From 0.27× to 1.01× — by changing the loop structure, not the algorithm. Both versions compute the same dot product. The fast one just asks the CPU to think about 32 elements instead of 1.

What You're Actually Seeing

This isn't "Eä is fast." Eä is a thin wrapper around LLVM. The FMA kernel compiles to about 12 assembly instructions in the inner loop. Eä's job is to make writing those 12 instructions feel like writing 30 lines of readable code instead of 200 lines of intrinsics.

The real insight is about NumPy's cost model:

OperationNumPy
a * b + c2 passes over data1 pass (fused FMA)
Temporary arrays1 allocation (64 MB)0 allocations
SIMD width"whatever the compiler picks"explicit f32x8
Memory round-trips2 (multiply result → RAM → add)0

NumPy is fast per operation. But real workloads aren't one operation — they're pipelines. And every np. call is a full scan of your data.

When Eä Doesn't Win

Bandwidth-bound operations with a single arithmetic op per element. Scaling an array:

out = src * 2.5

NumPy: 5,631 µs. Eä: 6,138 µs. NumPy wins (by 8%).

Both saturate memory bandwidth at ~22 GB/s. There's nothing to fuse. One multiply per element loaded. No amount of SIMD cleverness makes DRAM faster.

This is the honest dividing line: if your inner loop does fewer than ~4 operations per element, NumPy is fast enough. If it does more, you're leaving performance on the table.

Getting Started

pip install ea-compiler
import ea

kernel = ea.load("fma.ea")

import numpy as np
a = np.random.rand(16_000_000).astype(np.float32)
b = np.random.rand(16_000_000).astype(np.float32)
c = np.random.rand(16_000_000).astype(np.float32)
out = np.zeros(16_000_000, dtype=np.float32)

kernel.fma_f32x8(a, b, c, out)  # 7 ms instead of 46 ms

No Cython. No Numba. No C compiler. No JIT warmup. Write a .ea file, call ea.load(), pass NumPy arrays.

The compiler handles SIMD, the binding handles types, and the hardware handles the rest.


All benchmarks measured on a 2-core machine. Your numbers will be different, probably better (more cores = more bandwidth). The ratios hold. Source code and methodology on GitHub.

Eä is open-source (Apache 2.0). The compiler is ~12,000 lines of Rust with 475+ tests. Documentation. GitHub.

I Wrote a SIMD Compiler in 12K Lines of Rust

And then an LLM optimized the kernels better than I could.


The Elevator Pitch

Eä is a compiler for SIMD kernels. You write a small .ea file, run one command, and call it from Python like a normal function. But it runs at native vectorized speed.

import ea
kernel = ea.load("fma.ea")
result = kernel.fma_f32x8(a, b, c, out)  # 6.6× faster than NumPy

No ctypes. No header files. No build system. The compiler generates the shared library and the Python wrapper. Also Rust, C++, PyTorch, and CMake bindings. It targets x86-64 (AVX2/AVX-512) and AArch64 (NEON).

The whole compiler is 12,000 lines of Rust. 475 tests. One person. I'm not a compiler engineer. But I had a very specific problem, and it turns out that's enough.

Why

I had a problem that kept repeating. I'd write something in Python, profile it, find a hot loop, and think: "this needs to be fast." And I knew that fast meant C. Fast meant SIMD. I don't have deep experience with either, but I knew that's where the performance lives.

So I'd fumble through some C code (with an LLM helping me write it, which honestly made it worse because now the code worked but I didn't fully understand why), fight with ctypes, spend an afternoon debugging pointer arithmetic that the AI got right the first time but I broke while "cleaning up," and eventually get a 5× speedup. Then next week, different project, same dance.

I didn't mind the hard part. Figuring out what the kernel should do, thinking about memory access patterns, deciding on vector widths. That's the interesting problem. What I minded was the plumbing. The header files. The build system. The ctypes declarations. The dtype validation. All of it boilerplate, all of it error-prone, none of it the actual work.

So I thought: what if a compiler could handle the plumbing? You write the kernel in a simple language, something that looks like the pseudocode you'd sketch on a whiteboard, and the compiler handles everything else. Compile to a shared library. Auto-generate the Python wrapper. One command. No Makefile.

The "no glue code" part turned out to be the product. Not the SIMD. Not the compiler. The fact that you go from .ea file to working Python function in one command, with types checked, lengths inferred, and output buffers allocated. That's what makes people actually use it instead of just reading about it.

I didn't know how to build a compiler. But I had the idea, and I wanted to see if it would work.

The First Attempt (10K Lines of Pain)

I should tell you about the compiler I wrote before this one.

It also targeted LLVM. It also generated SIMD code. The codegen was 10,000 lines in a single file. The parser was hand-written but handled way too many features. There were no hard limits on file size, no style rules, no test discipline. I kept adding things (generics, a module system, type inference) because why not?

The codebase became unmaintainable in about three months. I couldn't change anything without breaking something else. The codegen file had functions that called functions that called functions eight levels deep, and half of them were handling edge cases for features nobody used.

I threw it away.

The lesson: a SIMD kernel compiler doesn't need generics. It doesn't need modules. It doesn't need type inference. It needs to compile load, store, fma, and splat correctly, generate clean bindings, and stay small enough that one person can hold it in their head.

The Hard Rules

For the second attempt, I wrote the rules before I wrote the code:

  1. No file exceeds 500 lines. Split before you hit the limit.
  2. Every feature proven by end-to-end test. If it's not tested, it doesn't exist.
  3. No fake functions. If hardware doesn't support an operation, the compiler errors. No silent fallbacks.
  4. No premature features. Don't build what isn't needed yet.
  5. Delete, don't comment. Dead code gets removed.

The 500-line rule was the hardest to follow and the most valuable. It forced me to split the type checker into 7 files, the codegen into 10 files, and the parser into 4 files. Each file does one thing. When I need to change how store is type-checked, I open intrinsics_memory.rs (309 lines) and the answer is right there. No grepping through 10K lines of spaghetti.

Was it frustrating? Constantly. I'd be in the middle of adding a feature, hit 480 lines, and have to stop and refactor before I could finish. But every time I did, the code got better. The refactor always revealed something. A responsibility that should have been split earlier. A function that was doing two things.

The "no premature features" rule was the other hard one. I kept wanting to add generics. Or a module system. Or traits. And every time, I'd ask myself: does this serve the goal of compiling SIMD kernels to callable shared libraries? The answer was always no. Eä is monomorphic by design. You write f32x8, you get f32x8. No hidden specialization, no surprise codegen, no combinatorial explosion of type instances.

It's not a limitation. It's the point.

The Architecture

.ea → Lexer (logos) → Parser → Desugar → Type Check → Codegen (LLVM 18) → .o / .so
                                                                          → .ea.json → ea bind

The most important insight: the desugarer is the most important pass.

Eä has a kernel construct that looks like this:

export kernel vscale(data: *f32, out: *mut f32, factor: f32)
    over i in n step 8
    tail scalar { out[i] = data[i] * factor }
{
    store(out, i, load(data, i) .* splat(factor))
}

The desugarer turns this into a plain function with a while-loop and a tail loop. After desugaring, there are no kernels in the AST. Just functions with loops. The type checker and codegen never see kernel. They only see func.

This means every downstream pass is simpler. The type checker doesn't need special kernel logic. The codegen doesn't need to handle iteration. The desugarer handles all of it: injecting the n parameter, generating the loop variable, building the main loop with step, building the tail loop with the chosen strategy.

The desugar pass is 340 lines. It eliminates an entire class of complexity from the remaining 11,660 lines.

The Binding Generators

This is the part people don't expect. The compiler generates .ea.json metadata describing each exported function's signature. Then ea bind reads the JSON and generates idiomatic wrappers:

ea bind kernel.ea --python --rust --cpp --pytorch --cmake

The Python generator does something clever: length collapsing. If your kernel takes (data: *f32, n: i32), the generated Python function takes just data and fills n from data.size automatically. Output parameters marked with out get auto-allocated. The generated code checks dtypes, casts pointers, and handles all the ctypes plumbing that used to take me an afternoon.

Five binding generators, each 200-460 lines. No serde. The JSON parser is hand-written (65 lines). The generated code is clean enough that you can read it, modify it, and learn from it.

Error Messages (The Quiet Win)

I spent more time on error messages than on codegen.

The compiler has "did you mean?" suggestions with Levenshtein distance, dot-operator hints ("cannot use '+' on vectors, use '.+'"), let mut suggestions, type conversion hints, multi-character underlines showing the full expression span, and clear messages that never leak internal compiler state.

kernel.ea:5:12  error[type]: cannot use '+' on vectors. Use '.+' for element-wise vector operations
        return a + b
               ^^^^^

Nobody notices good error messages. But everyone notices bad ones. The difference between a user who gives up and a user who fixes their code is often one helpful error message.

The Autoresearch System (The Fun Part)

This is where it gets interesting.

Inspired by Andrej Karpathy's autoresearch concept, I built an automated optimization loop: an LLM reads the kernel source, the benchmark results, and the history of what's been tried, then proposes a modified kernel. The system compiles it, benchmarks it across multiple data sizes (to catch cache-fitting illusions), verifies correctness against a C reference, and accepts or rejects the change. Then it iterates.

This is where I had the most fun building Eä.

The first time I ran it on the FMA kernel, it found a 10% improvement in 30 iterations. I thought the kernel was already as good as it could get. The LLM found that 12× unrolling with stream stores beat 4× unrolling with regular stores at DRAM scale. I wouldn't have tried that. It sounds like overkill.

Then I let it run on the matrix multiplication kernel. 56% improvement. It switched from ijk to ikj loop order with 8× k-unrolling. I've heard of loop tiling. I couldn't have told you when to apply it. The LLM didn't need to "know." It just tried it and the benchmark said yes.

The thing that surprised me most: you think you have an optimal kernel. You let the LLM iterate 5 times and it finds 20% improvement. Okay, fine, maybe it wasn't optimal. So you let it iterate 50 times on the already-improved kernel. And it still finds improvements. The search space for kernel optimization is bigger than your intuition.

27 benchmark kernels, all scored on largest-size (real-world) data with GB/s bandwidth metrics. The system includes bottleneck classification that tells the LLM whether a kernel is DRAM-bound (don't bother with compute tricks), compute-bound (try wider SIMD, more accumulators), or mixed. The biggest wins:

KernelImprovementWhat Changed
Bitonic sort97%Replaced O(n²) Shellsort with sorting network
Matmul56%k×8 unroll, cache-friendly access
Conv2d 3×347%4× column unroll, prefetch, restrict
Edge detect41%f32x4 to f32x8 upgrade

The humbling part: I'm not an optimization expert. But it turns out you don't need to be one. You need a benchmark harness, a correctness check, and a system that's willing to try things you wouldn't think of.

The Numbers

Source:          12,000 lines of Rust
Tests:           475 end-to-end
Test method:     compile Eä → link with C → run binary → compare stdout
CI:              x86-64, AArch64 (native), Windows
LLVM backend:    18.1 via inkwell 0.8
Binding targets: Python, Rust, C++, PyTorch, CMake

Performance on a real workload (16M float32 elements):

FMA (fused multiply-add):  6.6× faster than NumPy    37.0 GB/s
Dot product:               matches BLAS               36.6 GB/s
SAXPY:                     2.1× faster than NumPy     35.2 GB/s

That's ~37 GB/s on a system with ~40 GB/s memory bandwidth. Near the hardware limit. There's not much room left, which means the code is doing roughly as little unnecessary work as possible.

What I'd Do Differently

Start with the binding generator. I built the compiler first and added bindings later. But the bindings are what make Eä useful. If I'd started by designing the ideal Python API and worked backward to the compiler, some early decisions would have been different.

Add ea inspect earlier. The instruction analysis tool that shows you vector/scalar instruction counts, FMA operations, load/store ratio, and performance hints. I added it late, but it would have caught optimization issues months earlier.

Write fewer features, sooner. Eä has kernel, foreach, for, while, structs, output annotations, conditional compilation, static assertions, and 30+ intrinsics. Most users need kernel, load, store, fma, and splat. I should have shipped a useful subset earlier and iterated based on real usage.

The Real Story

I'm not a compiler engineer. I don't have a CS degree. I'm the kind of person who has ideas and wants to see if they work.

What changed is the tooling. I built Eä with the help of AI models. Claude for the heavy lifting, my own judgment for the architecture and design decisions. The hard rules came from me (learned the painful way from the first attempt). The implementation speed came from having a capable coding assistant.

A year ago, this would have required a team. Now it's 12,000 lines and one person.

The interesting question isn't "can you build this?" anymore. It's what are you going to build next?

My advice: if you have an idea that feels too ambitious for one person, the calculus has changed. Try it. Set hard rules so the codebase stays manageable. Write tests for everything. And don't be afraid to throw away your first attempt.

I threw away 10K lines of bad compiler and started over. Best decision I made.


Eä is open-source under Apache 2.0. GitHub · Documentation · pip install ea-compiler