Getting Started
Eä is a compute kernel compiler. You write small, focused compute functions, compile them to native code (.so/.dll), and call them from Python, Rust, C++, or PyTorch via C ABI.
No runtime, no GC, no standard library. Just kernels.
Quick start
pip install ea-compiler
Write a kernel (scale.ea):
export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
let mut i: i32 = 0
while i < n {
dst[i] = src[i] * factor
i = i + 1
}
}
Call it from Python:
import ea
import numpy as np
kernel = ea.load("scale.ea")
src = np.random.randn(1_000_000).astype(np.float32)
dst = np.empty_like(src)
kernel.scale(src, dst, factor=2.0)
ea.load() compiles your kernel to a native shared library, caches it, and gives you a callable Python function. No Rust, no LLVM, no build step.
Next steps
- Installation — platform-specific setup
- Your First Kernel — write a SIMD kernel step by step
- Why Eä? — design philosophy and when to use Eä
Installation
From pip (recommended)
pip install ea-compiler
This gives you the ea compiler and the Python ea.load() API. No other dependencies needed (besides NumPy).
Works on:
- Linux x86_64
- Linux aarch64 (ARM)
- Windows x86_64
Verify installation
import ea
print(ea.__version__) # e.g., "1.7.0"
print(ea.compiler_version()) # same, from the bundled binary
Building from source
For development or unsupported platforms, see the eacompute README for instructions on building the compiler from source. This requires Rust and LLVM 18.
Your First Kernel
Let's write a kernel that scales an array by a constant factor, then upgrade it to use SIMD.
Step 1: Scalar version
Create scale.ea:
export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
let mut i: i32 = 0
while i < n {
dst[i] = src[i] * factor
i = i + 1
}
}
Key things to notice:
exportmakes the function callable from Python (C ABI)*f32is an immutable pointer to float32 — your input array*mut f32is a mutable pointer — your output arrayn: i32is the array length — the caller provides this- All types are explicit. No inference, no ambiguity.
Step 2: Call from Python
import ea
import numpy as np
kernel = ea.load("scale.ea")
src = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32)
dst = np.empty_like(src)
kernel.scale(src, dst, factor=2.0)
print(dst) # [2. 4. 6. 8.]
ea.load() compiles scale.ea the first time, then caches the result in __eacache__/. Subsequent calls load from cache instantly.
Step 3: SIMD version
Now let's process 8 floats at a time:
export func scale_simd(src: *f32, dst: *mut f32, factor: f32, n: i32) {
let s: f32x8 = splat(factor)
let mut i: i32 = 0
while i < n {
let v: f32x8 = load(src, i)
store(dst, i, v .* s)
i = i + 8
}
}
What changed:
f32x8— a vector of 8 floats (256-bit, uses AVX2 on x86)splat(factor)— broadcasts the scalar to all 8 lanesload(src, i)— loads 8 consecutive floats starting at indexiv .* s— element-wise multiply (the dot prefix.means "vector operation")store(dst, i, ...)— writes 8 floats back- We increment by 8, not 1
Important: This only works when
nis a multiple of 8. For arbitrary lengths, see Kernels for tail-handling strategies.
Step 4: Compare performance
import ea
import numpy as np
import time
kernel = ea.load("scale.ea")
src = np.random.randn(10_000_000).astype(np.float32)
dst = np.empty_like(src)
# Eä scalar
start = time.perf_counter()
for _ in range(100):
kernel.scale(src, dst, factor=2.0)
ea_scalar = (time.perf_counter() - start) / 100
# Eä SIMD
start = time.perf_counter()
for _ in range(100):
kernel.scale_simd(src, dst, factor=2.0)
ea_simd = (time.perf_counter() - start) / 100
# NumPy
start = time.perf_counter()
for _ in range(100):
np.multiply(src, 2.0, out=dst)
numpy_time = (time.perf_counter() - start) / 100
print(f"Eä scalar: {ea_scalar*1000:.2f} ms")
print(f"Eä SIMD: {ea_simd*1000:.2f} ms")
print(f"NumPy: {numpy_time*1000:.2f} ms")
For this simple operation, all three will be similar — it's bandwidth-bound (one operation per element loaded). Eä shines on compute-bound workloads where there are multiple operations per element. See the Cookbook for real-world comparisons.
Why Eä
What Eä is
Eä is a compute kernel compiler. You write small, focused numerical routines in Eä's explicit syntax, compile them to native shared libraries (.so on Linux, .dll on Windows), and call them from Python, Rust, C++, or PyTorch via C ABI. Eä is not a general-purpose programming language. It has no standard library, no garbage collector, no runtime. It compiles kernels.
The problem
Python is slow for tight numerical loops. When you need to run a custom algorithm over millions of elements -- a stencil, a reduction with custom logic, a chain of fused multiply-adds -- you hit a wall. The usual options are:
- NumPy: fast for operations it supports, but custom multi-step algorithms require chaining calls that allocate intermediates and re-scan memory.
- Numba: JIT-compiles Python, but debugging is difficult, compilation is unpredictable, and SIMD vectorization is implicit (you hope the compiler figures it out).
- Cython: requires learning a hybrid language, managing build systems, and still gives you limited control over vectorization.
- Writing C extensions: full control, but high friction. Manual Python/C bridging, header files, build scripts.
Eä targets the gap: you want native-speed kernels with explicit SIMD, but you do not want to maintain a C build system.
The philosophy
Eä is built on one principle: explicit over implicit.
- If you write
f32x8, you get 8-wide SIMD. The compiler will not silently fall back to scalar code. - If hardware does not support an operation (e.g., scatter without AVX-512), the compiler errors. It does not emit slow scalar code behind your back.
- All types are explicit. No type inference. You see exactly what is happening.
- All memory is caller-provided. Eä kernels never allocate. Pointers come from the host language.
There are no hidden performance cliffs. The code you write is the code that runs.
When to use Eä
Eä excels at compute-bound workloads where you do significant work per element loaded from memory:
- Stencil operations: convolutions, blurs, edge detection -- each output element reads from multiple inputs.
- Fused multiply-add chains: polynomial evaluation, IIR filters, dot products.
- Custom reductions: computing statistics, finding patterns, accumulating results with non-trivial logic.
- Particle simulations: N-body interactions, force calculations.
- Image processing pipelines: per-pixel math with multiple operations fused into one pass.
The common thread: you load data, do many arithmetic operations on it, then store results. The CPU spends most of its time computing, not waiting for memory.
When NOT to use Eä
Eä is the wrong tool for:
- Bandwidth-bound workloads: if you are just adding two arrays element-wise, NumPy already saturates memory bandwidth. Eä cannot make memory faster.
- General programming: no strings, no file I/O, no networking, no data structures beyond structs.
- Prototyping: write your algorithm in Python first, profile it, then port the hot loop to Eä.
- GPU workloads: Eä targets CPUs (x86-64 with AVX2/AVX-512, AArch64 with NEON).
A good rule of thumb: if your inner loop does fewer than 4 arithmetic operations per element loaded, NumPy is probably fast enough.
The compilation model
kernel.ea --> ea --lib --> kernel.so + kernel.ea.json
|
ea bind --python
|
kernel.py (generated wrapper)
One .ea file is one compilation unit. There are no imports, no modules. If you need to compose kernels, you do it at the C level -- each kernel is an independent shared library with a C ABI entry point.
The generated .ea.json metadata file describes the function signatures. The ea bind command reads it to generate idiomatic wrappers for your target language.
Quick taste
Here is a complete Eä kernel that scales an array of floats using 8-wide SIMD:
export kernel vscale(data: *f32, out: *mut f32, factor: f32)
over i in n step 8
tail scalar {
out[i] = data[i] * factor
}
{
let vf: f32x8 = splat(factor)
store(out, i, load(data, i) .* vf)
}
Compile and use from Python:
ea kernel.ea --lib
ea bind kernel.ea --python
import numpy as np
from kernel import vscale
data = np.array([1.0, 2.0, 3.0, 4.0, 5.0], dtype=np.float32)
result = vscale(data, factor=3.0)
# result: [3.0, 6.0, 9.0, 12.0, 15.0]
The main body processes 8 elements at a time with SIMD. The tail scalar block handles any remainder elements one at a time. The n parameter (the array length) is automatically injected into the function signature from the over i in n clause.
Language Basics
This page covers Eä's scalar language features. For SIMD vector types and operations, see SIMD.
Scalar types
Eä has fixed-size numeric types. No type inference -- you always write the type explicitly.
| Type | Description |
|---|---|
i8, i16, i32, i64 | Signed integers |
u8, u16, u32, u64 | Unsigned integers |
f32, f64 | Floating point |
bool | Boolean (true / false) |
Variables
All variables must have an explicit type annotation. Variables are immutable by default.
let x: i32 = 5
let y: f32 = 3.14
let flag: bool = true
To make a variable mutable, use mut:
let mut counter: i32 = 0
counter = counter + 1
Constants
Compile-time constants use const:
const PI: f32 = 3.14159
const BATCH_SIZE: i32 = 256
const EPSILON: f64 = 1e-9
Constants can be used in static_assert for compile-time checks:
const STEP: i32 = 8
static_assert(STEP > 0, "step must be positive")
Arithmetic and comparison
Standard arithmetic operators work on all numeric types:
let a: i32 = 10 + 3 // 13
let b: i32 = 10 - 3 // 7
let c: i32 = 10 * 3 // 30
let d: i32 = 10 / 3 // 3 (integer division)
let e: i32 = 10 % 3 // 1 (remainder)
Comparison operators return bool:
let lt: bool = a < b
let gt: bool = a > b
let le: bool = a <= b
let ge: bool = a >= b
let eq: bool = a == b
let ne: bool = a != b
Logical operators
Eä uses words, not symbols, for logical operations:
let both: bool = a > 0 and b > 0
let either: bool = a > 0 or b > 0
let neither: bool = not (a > 0 or b > 0)
Control flow
if / else if / else
if x > 0 {
println(1)
} else if x == 0 {
println(0)
} else {
println(-1)
}
while loops
let mut i: i32 = 0
while i < n {
out[i] = data[i] * 2
i = i + 1
}
for loops
Counted loops with an explicit step:
for i in 0..n step 1 {
out[i] = data[i] * 2
}
The step is required. The range 0..n is half-open: it includes 0 but excludes n.
foreach loops
A simpler counted loop when the step is always 1:
foreach (i in 0..n) {
out[i] = data[i] * 2
}
Loop unrolling
Wrap a loop in unroll(N) to unroll it at compile time:
unroll(4) {
foreach (j in 0..4) {
out[base + j] = data[base + j] * factor
}
}
Functions
Functions are declared with func. All parameter and return types are explicit:
func square(x: f32) -> f32 {
return x * x
}
func add(a: i32, b: i32) -> i32 {
return a + b
}
Functions without a return type return nothing (void):
func fill(out: *mut i32, n: i32, val: i32) {
foreach (i in 0..n) {
out[i] = val
}
}
Exported functions
To make a function callable from C/Python/Rust, prefix it with export:
export func dot_product(a: *f32, b: *f32, n: i32) -> f32 {
let mut sum: f32 = 0.0
foreach (i in 0..n) {
sum = sum + a[i] * b[i]
}
return sum
}
Only export func (and export kernel) produce symbols visible from outside. Non-exported functions are internal helpers.
Pointers
Pointers are how kernels receive data from the host language. There are four pointer variants:
| Syntax | Meaning |
|---|---|
*T | Read-only pointer |
*mut T | Mutable pointer (can write through it) |
*restrict T | Read-only, no aliasing (enables optimizations) |
*restrict mut T | Mutable, no aliasing |
Pointer indexing
Read from a pointer with bracket indexing:
let val: f32 = data[i] // data is *f32
Write through a mutable pointer:
out[i] = val // out is *mut f32
Type casts
Explicit casts convert between numeric types:
let x: i32 = 42
let f: f32 = to_f32(x) // 42.0
let d: f64 = to_f64(x) // 42.0
let back: i32 = to_i32(f) // 42
let wide: i64 = to_i64(x) // 42
There are no implicit conversions. Mixing types without a cast is a compile error.
println
println is the only output primitive. It exists for debugging:
println(42)
println(3.14)
println(true)
println("hello")
It accepts integers, floats, bools, and string literals. It does not support format strings.
What does not exist
Eä is deliberately minimal. The following features do not exist and are not planned:
- No generics or templates
- No traits or interfaces
- No modules or imports
- No heap allocation
- No strings (except literal arguments to
println) - No semicolons (statements are newline-separated)
- No closures or lambdas
- No enums or pattern matching
- No exceptions or error handling
One file, one compilation unit. Compose at the C level.
SIMD
SIMD (Single Instruction, Multiple Data) lets you process multiple values in a single CPU instruction. Eä gives you direct control over SIMD vectors -- what you write is what the CPU executes.
Vector types
A vector type holds a fixed number of elements of the same scalar type. The name encodes both the element type and the lane count:
| Type | Elements | Bits | x86 requirement | ARM requirement |
|---|---|---|---|---|
f32x4 | 4 x f32 | 128 | SSE | NEON |
i32x4 | 4 x i32 | 128 | SSE | NEON |
u8x16 | 16 x u8 | 128 | SSE | NEON |
i8x16 | 16 x i8 | 128 | SSE | NEON |
i16x8 | 8 x i16 | 128 | SSE | NEON |
f64x2 | 2 x f64 | 128 | SSE2 | NEON |
f32x8 | 8 x f32 | 256 | AVX2 | not available |
i32x8 | 8 x i32 | 256 | AVX2 | not available |
u8x32 | 32 x u8 | 256 | AVX2 | not available |
f64x4 | 4 x f64 | 256 | AVX2 | not available |
f32x16 | 16 x f32 | 512 | AVX-512 | not available |
If you use f32x8 on a machine without AVX2, or on ARM, the compiler will error. No silent scalar fallback.
To enable 512-bit vectors, compile with --avx512:
ea kernel.ea --lib --avx512
Dot operators
Element-wise vector operations use dot-prefixed operators. This distinguishes them from scalar operations and makes SIMD explicit in the source:
| Operator | Meaning |
|---|---|
.+ | Element-wise add |
.- | Element-wise subtract |
.* | Element-wise multiply |
./ | Element-wise divide |
.> | Element-wise greater than (returns bool vector) |
.< | Element-wise less than |
.>= | Element-wise greater or equal |
.<= | Element-wise less or equal |
.== | Element-wise equal |
.!= | Element-wise not equal |
.& | Element-wise bitwise AND |
.| | Element-wise bitwise OR |
.^ | Element-wise bitwise XOR |
.<< | Element-wise shift left |
.>> | Element-wise shift right (logical for unsigned, arithmetic for signed) |
Example:
let a: f32x4 = load(data, i)
let b: f32x4 = load(data, i + 4)
let sum: f32x4 = a .+ b
let product: f32x4 = a .* b
let mask: f32x4 = a .> b
Creating vectors
splat
Broadcast a scalar value to all lanes:
let factor: f32 = 2.5
let vf: f32x8 = splat(factor)
// vf = [2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5]
Loading from memory
Load a vector from a pointer at an element offset:
let v: f32x8 = load(data, i) // loads data[i..i+8]
The offset i is in elements, not bytes. load(data, 8) loads elements 8 through 15.
There are also typed load intrinsics when you want to be explicit about the element type:
let v: f32x8 = load_f32(data, i)
let v: f32x4 = load_f32x4(data, i)
Storing to memory
Write a vector to a mutable pointer:
store(out, i, result) // writes result to out[i..i+8]
Element access
Read individual lanes from a vector by index:
let v: f32x4 = load(data, 0)
let first: f32 = v[0]
let second: f32 = v[1]
Conditional selection
select picks lanes from two vectors based on a mask:
let mask: f32x4 = a .> b
let result: f32x4 = select(mask, a, b)
// where a > b, take a; otherwise take b
This compiles to a single blend instruction. There is no branching.
Scalar vs SIMD comparison
Here is the same operation -- scaling an array -- written both ways.
Scalar (processes one element per iteration):
export func scale(data: *f32, out: *mut f32, factor: f32, n: i32) {
foreach (i in 0..n) {
out[i] = data[i] * factor
}
}
SIMD (processes 8 elements per iteration):
export kernel vscale(data: *f32, out: *mut f32, factor: f32)
over i in n step 8
tail scalar {
out[i] = data[i] * factor
}
{
let vf: f32x8 = splat(factor)
store(out, i, load(data, i) .* vf)
}
The SIMD version does 8 multiplications in a single instruction. On a workload that is compute-bound (multiple operations per element), this translates to a proportional speedup.
Hardware targeting
By default, Eä targets AVX2 on x86-64 and NEON on AArch64. You can change this:
# Default (AVX2 on x86)
ea kernel.ea --lib
# Enable AVX-512
ea kernel.ea --lib --avx512
# Target a specific CPU
ea kernel.ea --lib --target=skylake
# Cross-compile for ARM
ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu
The compiler rejects code that requires features the target does not have. If you use f32x8 and target ARM, you get a compile error, not slow code.
Kernels
The kernel construct is Eä's main abstraction for writing vectorized loops with automatic tail handling. It is syntactic sugar -- the compiler transforms it into a plain function with a while-loop before any further compilation.
Basic syntax
export kernel name(params)
over i in n step S
tail strategy { tail_body }
{
main_body
}
name: the kernel's name, becomes the C ABI symbolparams: function parameters (pointers, scalars)over i in n: the loop variableiiterates from 0 tonstep S: how many elements the main body processes per iterationtail strategy: how to handle remainder elements whennis not a multiple ofSmain_body: the code that runs for each full chunk ofSelements
The range variable (here n) is automatically injected as an i32 parameter into the function signature. You do not declare it in the parameter list.
How it desugars
A kernel like this:
export kernel scale(data: *f32, out: *mut f32, factor: f32)
over i in n step 4
tail scalar { out[i] = data[i] * factor }
{
out[i] = data[i] * factor
out[i + 1] = data[i + 1] * factor
out[i + 2] = data[i + 2] * factor
out[i + 3] = data[i + 3] * factor
}
becomes equivalent to:
export func scale(data: *f32, out: *mut f32, factor: f32, n: i32) {
let mut i: i32 = 0
while i + 4 <= n {
out[i] = data[i] * factor
out[i + 1] = data[i + 1] * factor
out[i + 2] = data[i + 2] * factor
out[i + 3] = data[i + 3] * factor
i = i + 4
}
// tail: process remainder one at a time
while i < n {
out[i] = data[i] * factor
i = i + 1
}
}
The generated C signature is void scale(float*, float*, float, int) -- the n parameter appears last.
Tail strategies
The tail handles remainder elements when n is not evenly divisible by the step.
tail scalar
Process remainder elements one at a time. The tail body runs in a loop with step 1:
export kernel vscale(data: *f32, out: *mut f32, factor: f32)
over i in n step 8
tail scalar {
out[i] = data[i] * factor
}
{
let vf: f32x8 = splat(factor)
store(out, i, load(data, i) .* vf)
}
The main body uses 8-wide SIMD. The tail body is scalar code that handles 0 to 7 leftover elements. This is the most common tail strategy.
tail mask
Use masked load/store for the remainder. The tail body runs once (not in a loop) and must handle all remaining elements using masked operations:
export kernel vscale(data: *f32, out: *mut f32, factor: f32)
over i in len step 4
tail mask {
let rem: i32 = len - i
let vf: f32x4 = splat(factor)
let v: f32x4 = load_masked(data, i, rem)
store_masked(out, i, v .* vf, rem)
}
{
let vf: f32x4 = splat(factor)
store(out, i, load(data, i) .* vf)
}
The rem variable tells the masked operations how many elements are valid. This avoids the scalar loop entirely but requires masked intrinsics.
tail pad
The caller guarantees the input length is a multiple of the step. No tail body is generated:
export kernel fill(out: *mut i32, val: i32)
over i in n step 4
tail pad
{
out[i] = val
out[i + 1] = val
out[i + 2] = val
out[i + 3] = val
}
This produces the most efficient code but shifts the responsibility to the caller. If n is not a multiple of the step, the kernel will skip the remaining elements.
No tail clause
If you omit the tail clause entirely, the kernel only runs the main body. Remaining elements are not processed:
export kernel double_it(data: *i32, out: *mut i32)
over i in n step 1
{
out[i] = data[i] * 2
}
With step 1, there is no remainder, so omitting the tail is safe. With larger steps, you must ensure n is always a multiple of the step, or accept that trailing elements are skipped.
Complete example
A SIMD dot product kernel that handles any input length:
export kernel dot(a: *f32, b: *f32, out: *mut f32)
over i in n step 8
tail scalar {
out[0] = out[0] + a[i] * b[i]
}
{
let va: f32x8 = load(a, i)
let vb: f32x8 = load(b, i)
let products: f32x8 = va .* vb
let sum: f32 = reduce_add(products)
out[0] = out[0] + sum
}
The main body loads 8-wide vectors, multiplies them, reduces to a scalar sum, and accumulates into out[0]. The scalar tail handles any remaining 0-7 elements.
Structs
Structs in Eä are plain data containers with C-compatible memory layout. They have no methods, no constructors, no impl blocks. They exist so you can pass structured data between Eä kernels and host languages.
Defining a struct
struct Particle {
x: f32,
y: f32,
mass: f32,
}
Fields can be any scalar type. The memory layout matches C struct layout, so a Particle in Eä is identical to:
typedef struct { float x; float y; float mass; } Particle;
Creating and accessing
Inside Eä functions, you create a struct with literal syntax and access fields with dot notation:
func main() {
let p: Particle = Particle { x: 1.0, y: 2.0, mass: 10.0 }
println(p.x) // 1
println(p.mass) // 10
}
Mutable structs support field assignment:
func main() {
let mut p: Particle = Particle { x: 0.0, y: 0.0, mass: 1.0 }
p.x = 3.5
p.y = 7.0
println(p.x) // 3.5
}
Struct pointers
In exported functions, structs are typically passed via pointer from the host language:
struct Point {
x: f32,
y: f32,
}
export func get_x(p: *Point) -> f32 {
return p.x
}
export func set_point(p: *mut Point, nx: f32, ny: f32) {
p.x = nx
p.y = ny
}
Read-only pointers (*Point) allow field reads. Mutable pointers (*mut Point) allow field writes.
Arrays of structs
Pointer indexing works with struct arrays. Each element is a full struct:
struct Vec2 {
x: f32,
y: f32,
}
export func sum_x(vecs: *Vec2, n: i32) -> f32 {
let mut total: f32 = 0.0
let mut i: i32 = 0
while i < n {
total = total + vecs[i].x
i = i + 1
}
return total
}
From C, this is called with a pointer to a contiguous array of structs:
typedef struct { float x; float y; } Vec2;
extern float sum_x(const Vec2*, int);
Vec2 vecs[] = { {1.0f, 10.0f}, {2.0f, 20.0f}, {3.0f, 30.0f} };
float result = sum_x(vecs, 3); // 6.0
Passing from Python
Since Eä structs match C layout, you can use NumPy structured arrays or ctypes:
import numpy as np
particle_dtype = np.dtype([
('x', np.float32),
('y', np.float32),
('mass', np.float32),
])
particles = np.zeros(1000, dtype=particle_dtype)
The generated Python bindings handle the pointer passing automatically.
Limitations
- No methods or impl blocks. Structs are data only.
- No nested structs.
- No generics. Write separate struct definitions for each concrete type.
- Struct fields must be scalar types.
Common Intrinsics
This page covers the most frequently used intrinsics. For the complete list, see the Intrinsics Reference.
splat
Broadcast a scalar to all lanes of a vector:
let factor: f32 = 2.5
let vf: f32x8 = splat(factor)
// vf = [2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5]
Works with all vector types. The return type is inferred from the variable's type annotation.
load / store
Load a vector from a pointer at an element offset. Store writes a vector back:
let v: f32x8 = load(data, i) // read 8 floats starting at data[i]
store(out, i, v) // write 8 floats starting at out[i]
Offsets are in elements, not bytes.
Typed loads
When you want to be explicit about the element type:
let v: f32x8 = load_f32(data, i)
let v: f32x4 = load_f32x4(data, i)
These are equivalent to plain load but make the element type visible in the source.
stream_store
Non-temporal store — bypasses cache, used for write-only output. Vector or scalar value (v1.15.0 added scalar i16/u16/i32/u32/i64/u64). See reference for the full alignment contract, ordering contract, and anti-patterns.
stream_store(out, i, result) // vector
stream_store(out, i, scalar_value) // scalar (v1.15.0+)
fma
Fused multiply-add: computes a * b + c in a single instruction with a single rounding (more accurate than separate multiply and add):
// Scalar
let result: f32 = fma(a, b, c)
// Vector
let va: f32x8 = load(a, i)
let vb: f32x8 = load(b, i)
let vc: f32x8 = load(c, i)
let result: f32x8 = fma(va, vb, vc)
Works on f32, f64, and all float vector types. Maps to the hardware FMA instruction.
reduce_add
Sum all lanes of a vector down to a scalar:
let v: f32x8 = load(data, i)
let sum: f32 = reduce_add(v)
Works on all integer and float vector types. Useful for dot products, reductions, and histogram accumulation.
reduce_max / reduce_min
Find the maximum or minimum value across all lanes:
let v: f32x4 = load(data, i)
let biggest: f32 = reduce_max(v)
let smallest: f32 = reduce_min(v)
Works on integer and float vector types.
select
Per-lane conditional: where the mask is true, take from a; where false, take from b:
let mask: f32x8 = a .> b
let result: f32x8 = select(mask, a, b)
// element-wise max(a, b)
Compiles to a blend instruction. No branching. This is how you write branchless SIMD code.
sqrt / rsqrt
Square root and reciprocal square root (1/sqrt):
let x: f32 = 16.0
let root: f32 = sqrt(x) // 4.0
let v: f32x8 = load(data, i)
let roots: f32x8 = sqrt(v)
let inv_roots: f32x8 = rsqrt(v)
rsqrt uses the fast hardware approximation. Works on f32, f64, and float vector types.
min / max
Element-wise minimum and maximum. Works on both scalars and vectors:
// Scalar
let smaller: f32 = min(a, b)
let larger: f32 = max(a, b)
// Vector
let va: f32x8 = load(data, i)
let vb: f32x8 = load(data, j)
let mins: f32x8 = min(va, vb)
let maxs: f32x8 = max(va, vb)
movemask
Extract comparison results to an integer bitmask. x86 only:
let mask: f32x8 = a .> b
let bits: i32 = movemask(mask)
// bit k is 1 if lane k of a > lane k of b
Useful for branching on SIMD comparison results or counting matching elements. Each bit in the result corresponds to one vector lane.
sat_add / sat_sub
Saturating addition and subtraction. Values clamp to the type's min/max instead of wrapping on overflow. Cross-platform (ARM NEON + x86 SSE2):
let bright: u8x16 = sat_add(pixels, boost) // clamps at 255
let dark: u8x16 = sat_sub(pixels, reduce) // clamps at 0
Works with i8x16, u8x16, i16x8, u16x8. Signed vs unsigned saturation is determined by the element type. Both arguments must have the same type.
Masked memory operations
For tail handling, masked loads and stores read/write only the valid lanes:
let rem: i32 = n - i
let v: f32x4 = load_masked(data, i, rem) // load only 'rem' elements
store_masked(out, i, result, rem) // store only 'rem' elements
The rem parameter specifies how many elements (starting from lane 0) are valid. Lanes beyond rem are zero-filled on load and not written on store.
Summary table
| Intrinsic | Input | Output | Description |
|---|---|---|---|
splat(s) | scalar | vector | Broadcast to all lanes |
load(ptr, i) | pointer, offset | vector | Load vector from memory |
store(ptr, i, v) | pointer, offset, vector | void | Write vector to memory |
stream_store(ptr, i, v) | pointer, offset, vector or scalar | void | Non-temporal write |
fence_nt() | none | void | Store-store barrier for stream_store ordering |
fma(a, b, c) | 3 values | same type | a * b + c fused |
reduce_add(v) | vector | scalar | Sum all lanes |
reduce_max(v) | vector | scalar | Max across lanes |
reduce_min(v) | vector | scalar | Min across lanes |
select(m, a, b) | mask, 2 vectors | vector | Per-lane conditional |
sqrt(x) | scalar/vector | same type | Square root |
rsqrt(x) | scalar/vector | same type | Reciprocal square root |
min(a, b) | 2 values | same type | Element-wise minimum |
max(a, b) | 2 values | same type | Element-wise maximum |
sat_add(a, b) | 2 int vectors | same type | Saturating addition |
sat_sub(a, b) | 2 int vectors | same type | Saturating subtraction |
movemask(m) | bool vector | i32 | Extract lane mask to bits |
Specification
This page is the Eä language and library specification for eacompute. It
is the normative reference for what the compiler accepts and the semantics
of every built-in operation.
The specification is structured as the following parts:
- Type System — scalar, vector, pointer, and struct types; type rules including integer/float literal defaults and forbidden implicit conversions.
- All Intrinsics — every built-in function: memory (load/store/gather), math, reduction, vector, conversion, debug.
- CLI Reference — the
eadriver:compile,bind,inspect,print-target; target-feature flags (--avx512,--fp16,--i8mm,--dotprod); cross-compilation triple syntax. - ARM / NEON — AArch64-specific intrinsic surface, vector-width constraints, cross-compilation recipe, and the portable-kernels pattern.
- Binding Annotations —
outqualifier, length-collapse via[cap: n], and the.ea.jsonmetadata schema consumed by language bindings. - Python API — the
ea bind --python-generated module surface. ea bench— manifest schema, harness JSONL contract, baseline diff semantics.
Normative vs informative
The seven documents above are normative: the compiler's behavior is what they describe, and divergence is a bug in either the compiler or the spec.
The Guide and the Cookbook are informative — teaching material and worked examples. Where they conflict with the spec, the spec wins.
Stability and deprecation
The intrinsic and library API surface follows a deprecation cycle:
- Minor release. The old name continues to work and emits a
deprecation warning at every call site. The replacement is registered
in
src/typeck/deprecations.rs, and a migration entry lands in the upcoming-major-release file underdocs/migrations/. - Major release. The old name is removed. Callers who ignored the
warning now get an
unknown intrinsicerror; the migration file is the canonical recipe.
The cargo public-api CI gate prevents accidental Rust-side public-API
drift (snapshot at docs/public-api.txt).
Version
This specification reflects eacompute on main. For the spec at a
specific tagged version, browse the same files from that tag.
Type System
Eä is statically typed with no implicit conversions. Every variable, parameter, and expression has a concrete type known at compile time.
Scalar Types
| Type | Size | Description |
|---|---|---|
i8 | 1 byte | Signed 8-bit integer |
u8 | 1 byte | Unsigned 8-bit integer |
i16 | 2 bytes | Signed 16-bit integer |
u16 | 2 bytes | Unsigned 16-bit integer |
i32 | 4 bytes | Signed 32-bit integer |
u32 | 4 bytes | Unsigned 32-bit integer |
i64 | 8 bytes | Signed 64-bit integer |
u64 | 8 bytes | Unsigned 64-bit integer |
f32 | 4 bytes | 32-bit float (IEEE 754) |
f64 | 8 bytes | 64-bit float (IEEE 754) |
bool | 1 byte | Boolean (true or false) |
Integer literals default to i32. Float literals default to f32. Use explicit casts (to_f64(x), to_i64(x)) to convert between types.
Vector Types
Vector types hold multiple lanes of the same scalar type. Element-wise operations use dot-operators (.+, .-, .*, ./, etc.).
128-bit Vectors -- SSE (x86) / NEON (ARM)
| Type | Lanes | Element | Size |
|---|---|---|---|
f32x4 | 4 | f32 | 16 bytes |
f64x2 | 2 | f64 | 16 bytes |
i32x4 | 4 | i32 | 16 bytes |
u32x4 | 4 | u32 | 16 bytes |
i16x8 | 8 | i16 | 16 bytes |
u16x8 | 8 | u16 | 16 bytes |
i8x16 | 16 | i8 | 16 bytes |
u8x16 | 16 | u8 | 16 bytes |
u64x2 | 2 | u64 | 16 bytes |
256-bit Vectors -- AVX2 (x86 only)
| Type | Lanes | Element | Size |
|---|---|---|---|
f32x8 | 8 | f32 | 32 bytes |
f64x4 | 4 | f64 | 32 bytes |
i32x8 | 8 | i32 | 32 bytes |
i16x16 | 16 | i16 | 32 bytes |
u16x16 | 16 | u16 | 32 bytes |
i8x32 | 32 | i8 | 32 bytes |
u8x32 | 32 | u8 | 32 bytes |
u64x4 | 4 | u64 | 32 bytes |
These types produce a compile error on ARM targets.
512-bit Vectors -- AVX-512 (x86, --avx512 flag required)
f32x16, f64x8, i32x16, and u64x8 lower to AVX-512F instructions.
i8x64, u8x64, i16x32, and u16x32 additionally require AVX-512BW
(byte/word SIMD), which is present on Skylake-SP, Ice Lake, Zen 4, and
later. Eä emits both feature flags when --avx512 is set.
| Type | Lanes | Element | Size | Feature |
|---|---|---|---|---|
f32x16 | 16 | f32 | 64 bytes | AVX-512F |
f64x8 | 8 | f64 | 64 bytes | AVX-512F |
i32x16 | 16 | i32 | 64 bytes | AVX-512F |
u64x8 | 8 | u64 | 64 bytes | AVX-512F |
i16x32 | 32 | i16 | 64 bytes | AVX-512BW |
u16x32 | 32 | u16 | 64 bytes | AVX-512BW |
i8x64 | 64 | i8 | 64 bytes | AVX-512BW |
u8x64 | 64 | u8 | 64 bytes | AVX-512BW |
Using these types without --avx512 produces a compile error.
Half-precision and sub-128-bit narrow widths
The lexer also accepts f16x4 / f16x8 (half-precision float vectors) and
sub-128-bit narrow widths used by ARM NEON widening intrinsics: i8x4,
i8x8, u8x8, i16x4, u16x4, i32x2. These have target-specific
constraints (f16 requires --fp16 on aarch64 and is unavailable on
plain x86; the narrow widths exist primarily as inputs to widening
multiplies like wmul_i32(i16x4, i16x4) -> i32x4) that are documented
alongside the intrinsics that consume them rather than as standalone
table rows. See ARM / NEON reference for f16 and
All Intrinsics for the narrow-width consumers.
Pointer Types
Pointers represent caller-provided memory. Eä never allocates -- all memory comes from the host language.
| Syntax | Description |
|---|---|
*T | Immutable pointer to T |
*mut T | Mutable pointer to T |
*restrict T | Immutable pointer, no-alias guarantee |
*restrict mut T | Mutable pointer, no-alias guarantee |
The restrict qualifier tells the compiler that the pointer does not alias other pointers, enabling stronger optimizations. Use it when you can guarantee non-overlapping memory.
Struct Types
User-defined value types declared with struct:
struct Pixel {
r: f32,
g: f32,
b: f32,
a: f32,
}
Structs are passed by value. Access fields with dot syntax: pixel.r. Structs can contain scalar types, vector types, and other structs.
Type Rules
- No implicit conversions between types. Use
to_f32(),to_i32(), etc. - No generics or polymorphism. Write separate functions for each type.
- Vector dot-operators require both operands to have the same vector type.
- Comparisons on vectors (
.==,.<,.>) produce boolean vectors, not scalar bools.
All Intrinsics
Complete reference for every built-in function in Eä.
Memory
load
Load a vector from a pointer at a byte offset. The return type is inferred from context.
let v: f32x8 = load(ptr, i);
Typed Scalar Loads
Load a single scalar value from a pointer at a byte offset.
| Intrinsic | Return Type |
|---|---|
load_f32(ptr, i) | f32 |
load_f64(ptr, i) | f64 |
load_i32(ptr, i) | i32 |
load_i16(ptr, i) | i16 |
load_i8(ptr, i) | i8 |
load_u8(ptr, i) | u8 |
load_u16(ptr, i) | u16 |
load_u32(ptr, i) | u32 |
load_u64(ptr, i) | u64 |
let x: f32 = load_f32(ptr, i);
Typed Vector Loads
Load a full vector from a pointer at a byte offset.
| Intrinsic | Return Type |
|---|---|
load_f32x4(ptr, i) | f32x4 |
load_f32x8(ptr, i) | f32x8 |
load_f32x16(ptr, i) | f32x16 |
load_i32x4(ptr, i) | i32x4 |
load_i32x8(ptr, i) | i32x8 |
load_i16x8(ptr, i) | i16x8 |
load_i8x16(ptr, i) | i8x16 |
load_u8x16(ptr, i) | u8x16 |
load_u8x32(ptr, i) | u8x32 |
let v: f32x8 = load_f32x8(data, i * 32);
store
Write a vector to a pointer at a byte offset.
store(out, i, result);
stream_store
Non-temporal store that bypasses the CPU cache. Use for write-only output
the kernel will not read back soon. Pairs with prefetch_nta (the read-side
non-temporal hint) and fence_nt (the ordering primitive).
stream_store(out, i, v) // vector form — v is f32xN, i32xN, etc.
stream_store(out, i, scalar_value) // scalar form (v1.15.0) — i16/u16/i32/u32/i64/u64
Target lowering (verified on LLVM 18.1.8, x86_64 Zen 4 and aarch64 Cortex-A76):
| Width / form | x86 | aarch64 |
|---|---|---|
| Vector 128-bit (f32x4, i32x4, ...) | movntps / movntdq | stnp d, d, [x] (LLVM splits the 128-bit q-register into a d-pair to use the only available aarch64 NT store) |
| Vector 256-bit (AVX2) | vmovntps / vmovntdq | n/a (256-bit not on NEON) |
| Vector 512-bit (AVX-512) | vmovntps / vmovntdq zmm | n/a |
| Scalar i64 / u64 | movnti (SSE2, 64-bit mode) | stnp w, w, [x] (LLVM splits the i64 into a w-pair to use stnp; emits an lsr for the high half) |
| Scalar i32 / u32 | movnti (SSE2) | plain str — NT hint silently dropped |
| Scalar i16 / u16 | regular mov — NT hint silently dropped | plain strh — NT hint silently dropped |
aarch64 has no scalar non-temporal store instruction. The only NT store on
aarch64 is stnp (Store Non-temporal Pair), which requires two operands. LLVM
18 honors !nontemporal only when it can synthesize an stnp:
- 64-bit scalars (i64/u64) self-pair to a
wregister pair — NT hint preserved. - 128-bit vectors self-pair to a
dregister pair — NT hint preserved. - 32-bit and 16-bit scalars have no pair-friendly form — LLVM emits plain
str/strhand the NT hint is dropped silently.
For aarch64 NT semantics, prefer 64-bit or wider element widths. The i32/u32
and i16/u16 scalar overloads still type-check and run, but provide no cache-
bypass benefit on aarch64; the same is true of i16/u16 on x86 (no movnti16
exists). They ship for cross-platform shape symmetry — a single Eä kernel using
stream_store compiles and runs on both targets without per-width branching.
Note also that LLVM 18 does not fuse two adjacent stream_store(*mut i32, ...)
calls into a single stnp pair on aarch64 — they lower to a regular stp with
the NT hint dropped. If you need NT-paired stores on aarch64, write the data as
i64 (two 32-bit values packed) or as a vector type.
Alignment contract:
Vector stream_store requires the destination pointer plus byte offset to
be aligned to the vector's natural size (16 bytes for 128-bit, 32 bytes for
256-bit, 64 bytes for 512-bit). Misaligned NT vector stores raise a general
protection fault on x86. Scalar stream_store requires natural alignment to
the scalar size on x86 (4-byte for i32/u32, 8-byte for i64/u64). Callers
must provide aligned buffers; Eä does not insert runtime alignment checks.
Ordering contract:
NT stores are weakly ordered on x86 (write-combining memory order). Other
cores or subsequent reads in the same thread may observe them out of
program order. For cross-thread visibility, the typical pathway is through
a host-side synchronization primitive after the kernel returns
(pthread_join, rayon::scope, WaitGroup.Wait) — these provide release
semantics that flush WC buffers. For intra-kernel ordering (writing then
reading the same memory in the same kernel call), use fence_nt()
explicitly. Eä does not insert an implicit fence at kernel return.
When NOT to use:
Do not use stream_store for working buffers the same kernel reads back.
The non-temporal hint asks the cache to not keep the line; if the kernel
reads the data soon afterward, the read goes to DRAM and is slower than a
regular store followed by a cache hit. Working-buffer examples that
should use plain store:
- Softmax accumulators (e.g.
scores_bufin attention kernels) - FWHT scratch arrays (e.g.
scratchin JL-projection kernels) - Per-iteration partial sums or running statistics
stream_store is appropriate when the destination is a final output passed
to the next kernel call, a memory region the current kernel never re-reads,
or a buffer that will not be touched again until a downstream consumer
pulls it from DRAM later.
fence_nt
Store-store memory barrier providing intra-kernel ordering of preceding
stream_store operations. Zero arguments, returns void.
fence_nt()
Target lowering:
| Target | Instruction |
|---|---|
| x86 | sfence (via @llvm.x86.sse.sfence) |
| aarch64 | dmb ishst (via @llvm.aarch64.dmb with operand 10) |
These are the narrowest available barriers for store-only ordering —
explicit target intrinsics rather than the IR-level fence release, which
would lower to mfence on x86 and dmb ish on aarch64 (both heavier than
needed for NT-store ordering).
Semantics:
fence_nt() orders stream_store writes relative to each other and
relative to subsequent regular stores. It does not order stores relative
to subsequent loads — for a write-then-read-back pattern in the same kernel,
a full barrier (mfence on x86, dmb sy on aarch64) is needed instead.
Eä does not currently expose a full-barrier intrinsic.
When to use:
Use fence_nt() when the same kernel writes via stream_store to multiple
non-overlapping regions in a defined order and a later kernel (or downstream
reader) relies on observing those writes in the same order. This is
uncommon — most callers don't need it, because cross-thread visibility
comes from the host's sync primitive (pthread_join, rayon::scope,
WaitGroup.Wait) which already provides release semantics that flush
write-combining buffers between threads.
When NOT to use:
- Between successive
stream_storecalls to different addresses if no store-ordering requirement exists — NT stores to the same address complete in program order regardless of fences. - At the end of a kernel as "insurance" — the caller's sync primitive handles cross-thread fencing more efficiently after the kernel returns.
- For write-then-read-back patterns —
fence_nt()does not provide store- to-load ordering. Use a regularstorefor the working data, or do not read NT-written data back in the same kernel.
load_masked
Masked vector load. Lanes where the mask is false are not loaded.
let v: f32x8 = load_masked(ptr, i, mask);
store_masked
Masked vector store. Only lanes where the mask is true are written.
store_masked(out, i, value, mask);
gather
Load elements from scattered memory addresses using an index vector. x86 only -- not available on ARM.
let v: f32x8 = gather(ptr, indices);
scatter
Store elements to scattered memory addresses using an index vector. AVX-512 only (--avx512 flag required).
scatter(ptr, indices, values);
prefetch
Issue a read-intent prefetch hint to bring data into all cache levels (T0).
prefetch(ptr, i)
Lowers to prefetcht0 on x86 / prfm pldl1keep on aarch64.
prefetch_write
Issue a write-intent prefetch hint. Signals the cache coherence protocol to acquire the target line in modified state ahead of the store, avoiding a read-for-ownership stall when the store retires. Use on the upcoming write target of memory-bound store-heavy kernels (e.g. chacha20 ciphertext output, dequantize destinations).
prefetch_write(ptr, i)
Lowers to prefetchw on x86 (requires PRFCHW CPUID; falls back to
prefetcht0 on older CPUs) / prfm pstl1keep on aarch64.
prefetch_nta
Issue a non-temporal prefetch hint — bring the line into L1 only and mark it for early eviction. Use for streaming reads the kernel touches exactly once and shouldn't pollute L1/L2 with (e.g. Q4 dequantize input, large one-pass scans).
prefetch_nta(ptr, i)
Lowers to prefetchnta on x86 / prfm pldl1strm on aarch64.
Prefetch hint summary
| Intrinsic | (rw, locality) | x86 | aarch64 |
|---|---|---|---|
prefetch | (0, 3) | prefetcht0 | prfm pldl1keep |
prefetch_write | (1, 3) | prefetchw | prfm pstl1keep |
prefetch_nta | (0, 0) | prefetchnta | prfm pldl1strm |
All three accept (ptr, integer-offset), return void, and are valid in
any expression-statement position inside a function body.
Math
sqrt
Square root. Works on scalar f32/f64 and all float vector types.
let y: f32 = sqrt(x);
let v: f32x8 = sqrt(vec);
rsqrt
Reciprocal square root (approximate). Scalar f32 and f32 vector types.
let y: f32 = rsqrt(x);
let v: f32x8 = rsqrt(vec);
exp
Exponential function. Float types.
let y: f32 = exp(x);
fma
Fused multiply-add: computes a * b + c in a single operation with one rounding step. Works on scalar f32 and all float vector types.
let y: f32 = fma(a, b, c);
let v: f32x8 = fma(va, vb, vc);
min
Element-wise minimum. Works on scalar (i32, f32, f64) and vector types.
let m: i32 = min(a, b);
let v: f32x8 = min(va, vb);
max
Element-wise maximum. Works on scalar (i32, f32, f64) and vector types.
let m: i32 = max(a, b);
let v: f32x8 = max(va, vb);
Reduction
Reduce a vector to a single scalar value.
reduce_add
Sum all lanes.
let sum: f32 = reduce_add(v); // f32x8 -> f32
let sum: i32 = reduce_add(iv); // i32x8 -> i32
reduce_max
Maximum across all lanes.
let m: f32 = reduce_max(v);
reduce_min
Minimum across all lanes.
let m: f32 = reduce_min(v);
reduce_add_fast
Unordered float reduction. Faster than reduce_add but does not guarantee summation order, so results may differ slightly due to floating-point rounding. Float vectors only.
let sum: f32 = reduce_add_fast(v);
Vector
splat
Broadcast a scalar value to all lanes of a vector. The vector type is inferred from context.
let v: f32x8 = splat(1.0);
shuffle
Compile-time index-driven vector shuffle. Two forms:
Single-source — permute lanes within one vector.
let reversed: f32x4 = shuffle(v, [3, 2, 1, 0])
Each index is in [0, width).
Two-source — pick lanes from two vectors of the same type.
let zipped: f32x8 = shuffle(a, b, [0, 8, 1, 9, 2, 10, 3, 11])
Indices in [0, width) select from a; indices in [width, 2 * width) select lane i - width from b. Common patterns: interleave (lower-half zip), blend (pick lanes by index), and concatenate-permute (lower-half from one source, upper from the other with permutation).
permute_runtime
Runtime-indexed permute of an 8-lane vector. result[k] = table[indices[k] & 0x7]. Index lanes are 3-bit-masked by the hardware — the upper 29 bits of each i32 index are ignored. Out-of-range indices wrap; they do not trap.
- x86: single instruction (
vpermpsfor f32,vpermdfor i32). Requires AVX2. - ARM (NEON): not supported. Compile-time error pointing at the NEON runtime-permute idiom.
let table: f32x8 = load(matrix_row, 0) // 6 active + 2 don't-care
let indices: i32x8 = load(types_v, 0) // values in [0..5]
let strengths: f32x8 = permute_runtime(table, indices)
| Signature | (f32x8, i32x8) -> f32x8, (i32x8, i32x8) -> i32x8 |
|---|
See also: shuffle (compile-time indices), gather (pointer-indexed), shuffle_bytes (byte-level, 16-byte table, cross-platform).
select
Per-lane conditional select. Where the mask is true, take from a; where false, take from b.
let result: f32x8 = select(mask, a, b);
movemask
Extract a comparison result bitmask from a boolean vector to a scalar i32. Each bit corresponds to the sign bit of one lane. x86 only -- not available on ARM.
let bits: i32 = movemask(cmp_result);
Conversion
Scalar Casts
| Intrinsic | Description |
|---|---|
to_f32(x) | Convert to f32 |
to_f64(x) | Convert to f64 |
to_i32(x) | Convert to i32 |
to_i64(x) | Convert to i64 |
let f: f32 = to_f32(i);
let n: i32 = to_i32(x);
Widening Conversions
Widen narrow integer lanes to wider float or integer lanes. Only the first N lanes of the input are consumed.
| Intrinsic | Input | Output |
|---|---|---|
widen_i8_f32x4(v) | i8x16 | f32x4 |
widen_u8_f32x4(v) | u8x16 | f32x4 |
widen_i8_f32x8(v) | i8x16 | f32x8 |
widen_u8_f32x8(v) | u8x16 | f32x8 |
widen_i8_f32x16(v) | i8x16 | f32x16 |
widen_u8_f32x16(v) | u8x16 | f32x16 |
widen_u8_i32x4(v) | u8x16 | i32x4 |
widen_u8_i32x8(v) | u8x16 | i32x8 |
widen_u8_i32x16(v) | u8x16 | i32x16 |
let pixels: f32x8 = widen_u8_f32x8(raw_bytes);
Lane-offset variants
The _4, _8, _12 suffixes select which 4 bytes of the input to widen, eliminating the need for a shuffle before widening:
| Intrinsic | Input | Output | Bytes used |
|---|---|---|---|
widen_u8_f32x4_4(v) | u8x16 | f32x4 | 4-7 |
widen_u8_f32x4_8(v) | u8x16 | f32x4 | 8-11 |
widen_u8_f32x4_12(v) | u8x16 | f32x4 | 12-15 |
widen_i8_f32x4_4(v) | i8x16 | f32x4 | 4-7 |
widen_i8_f32x4_8(v) | i8x16 | f32x4 | 8-11 |
widen_i8_f32x4_12(v) | i8x16 | f32x4 | 12-15 |
widen_u8_i32x4_4(v) | u8x16 | i32x4 | 4-7 |
widen_u8_i32x4_8(v) | u8x16 | i32x4 | 8-11 |
widen_u8_i32x4_12(v) | u8x16 | i32x4 | 12-15 |
Process all 16 bytes of a u8x16 as 4 groups of f32x4 without any shuffles:
let f0: f32x4 = widen_u8_f32x4(v) // bytes 0-3
let f1: f32x4 = widen_u8_f32x4_4(v) // bytes 4-7
let f2: f32x4 = widen_u8_f32x4_8(v) // bytes 8-11
let f3: f32x4 = widen_u8_f32x4_12(v) // bytes 12-15
Narrowing Conversions
Convert wider lanes to narrower lanes, with clamping and rounding.
| Intrinsic | Input | Output |
|---|---|---|
narrow_f32x4_i8(v) | f32x4 | i8 (4 bytes) |
let packed = narrow_f32x4_i8(float_pixels);
Multiply-Add Byte Pairs
Multiply unsigned bytes by signed bytes and add adjacent pairs. x86 only.
| Intrinsic | Signature | Description |
|---|---|---|
maddubs_i16(a, b) | (u8x16, i8x16) -> i16x8 | Multiply and add adjacent pairs to 16-bit |
maddubs_i32(a, b) | (u8x16, i8x16) -> i32x4 | Multiply and add adjacent quads to 32-bit |
let products: i16x8 = maddubs_i16(unsigned_bytes, signed_weights);
vdot_i32
Signed integer dot product: multiplies groups of 4 i8 pairs and sums each group into one i32 lane. ARM only -- requires --dotprod flag (ARMv8.2-A dot product extension). Maps to NEON sdot.
let dot: i32x4 = vdot_i32(activations, weights);
acc = acc .+ vdot_i32(a, b); // accumulate explicitly
| Signature | (i8x16, i8x16) -> i32x4 |
|---|
I8MM Matrix Multiply
Matrix multiply-accumulate on int8 data. ARM only -- requires --i8mm flag (ARMv8.6-A I8MM extension). Available on Cortex-A78+, Apple M1+.
| Intrinsic | Signature | Description |
|---|---|---|
smmla_i32(acc, a, b) | (i32x4, i8x16, i8x16) -> i32x4 | Signed x signed |
ummla_i32(acc, a, b) | (i32x4, u8x16, u8x16) -> i32x4 | Unsigned x unsigned |
usmmla_i32(acc, a, b) | (i32x4, u8x16, i8x16) -> i32x4 | Unsigned x signed |
The accumulator is the first argument. Each instruction performs a 2x8 x 8x2 matrix multiply and adds the result to the accumulator. Use splat(0) as accumulator for the first iteration.
let zero: i32x4 = splat(0);
let result: i32x4 = smmla_i32(zero, activations, weights);
// accumulate over multiple chunks:
acc = smmla_i32(acc, next_a, next_b);
Widening Multiply
Multiply narrow integer lanes and produce wider output. ARM only (base NEON). Input types are 64-bit NEON vectors.
| Intrinsic | Signature | Description |
|---|---|---|
wmul_i16(a, b) | (i8x8, i8x8) -> i16x8 | Signed 8-bit to 16-bit |
wmul_u16(a, b) | (u8x8, u8x8) -> u16x8 | Unsigned 8-bit to 16-bit |
wmul_i32(a, b) | (i16x4, i16x4) -> i32x4 | Signed 16-bit to 32-bit |
wmul_u32(a, b) | (u16x4, u16x4) -> u32x4 | Unsigned 16-bit to 32-bit |
let wide: i16x8 = wmul_i16(bytes_a, bytes_b);
Absolute Difference
Element-wise |a - b|. ARM only (base NEON). Maps to a single instruction (sabd/uabd).
| Intrinsic | Supported Types |
|---|---|
abs_diff(a, b) | i8x16, u8x16, i16x8, u16x8, i32x4, u32x4 |
let diff: u8x16 = abs_diff(frame_a, frame_b);
Saturating Arithmetic
Addition and subtraction that clamp to the type's min/max instead of wrapping on overflow. Cross-platform (ARM NEON + x86 SSE2).
| Intrinsic | Supported Types |
|---|---|
sat_add(a, b) | i8x16, u8x16, i16x8, u16x8 |
sat_sub(a, b) | i8x16, u8x16, i16x8, u16x8 |
Signed vs unsigned saturation is determined by the element type. Both arguments must have the same type.
let bright: u8x16 = sat_add(pixels, boost); // clamps at 255, never wraps
let dark: u8x16 = sat_sub(pixels, reduce); // clamps at 0, never wraps
shuffle_bytes
Byte-level table lookup: each byte in indices selects a byte from table. Cross-platform. x86: SSSE3 pshufb. ARM: NEON tbl. Out-of-range indices (>15) zero the lane on both platforms.
let result: u8x16 = shuffle_bytes(table, indices);
| Signature | (u8x16, u8x16) -> u8x16 |
|---|
Rounding & Packing
| Intrinsic | Signature | Description | Platform |
|---|---|---|---|
round_f32x4_i32x4 | (f32x4) -> i32x4 | Round-to-nearest-even. x86: cvtps2dq. ARM: fcvtns. | cross-platform |
pack_sat_i32x4 | (i32x4, i32x4) -> i16x8 | Saturating narrow. x86: packssdw. ARM: sqxtn. | cross-platform |
pack_sat_i16x8 | (i16x8, i16x8) -> i8x16 | Saturating narrow. x86: packsswb. ARM: sqxtn. | cross-platform |
round_f32x8_i32x8 | (f32x8) -> i32x8 | Round-to-nearest-even float to integer. x86: vcvtps2dq (AVX2). | x86-only |
pack_sat_i32x8 | (i32x8, i32x8) -> i16x16 | Saturating narrow two i32x8 into i16x16. x86: vpackssdw (AVX2). | x86-only |
pack_sat_i16x16 | (i16x16, i16x16) -> i8x32 | Saturating narrow two i16x16 into i8x32. x86: vpacksswb (AVX2). | x86-only |
pack_sat_i32x8andpack_sat_i16x16emit avpermqfixup after the AVX2 pack to produce sequential output[a0..a7, b0..b7], matching what the 128-bit variants produce.
Debug
println
Print a value to stdout. Accepts scalars (i32, i64, u8, u16, u32, u64, f32, f64, bool), string literals, and vector types. Lowers to C printf. No format strings.
println(42);
println(3.14);
println("hello");
println(my_vector);
CLI Reference
The ea binary provides three commands: compile (default), bind, and inspect.
Compile
ea <file.ea> [flags]
Compile an Eä source file to a native object file (.o) by default.
Flags
| Flag | Effect |
|---|---|
-o <name> | Link the object file into an executable via cc |
--lib | Produce a shared library (.so/.dll) and metadata (.ea.json) |
--opt-level=N | Optimization level 0--3 (default: 3) |
--avx512 | Enable AVX-512 vector types and intrinsics. Errors on ARM targets |
--dotprod | Enable ARMv8.2-A dot product extension (vdot_i32). ARM targets only |
--i8mm | Enable ARMv8.6-A I8MM extension (smmla_i32, ummla_i32, usmmla_i32). ARM targets only |
--target=CPU | LLVM CPU name, e.g. skylake, znver3, native (default: native) |
--target-triple=T | Cross-compile to a different architecture, e.g. aarch64-unknown-linux-gnu |
--emit-llvm | Write LLVM IR to a .ll file and print it to stdout |
--emit-asm | Write assembly to a .s file |
--header | Generate a C header (.h) for the exported functions |
--emit-ast | Print the parsed AST. Does not require LLVM |
--emit-tokens | Print the token stream. Does not require LLVM |
--help / -h | Print usage information |
--version / -V | Print compiler version |
Examples
# Compile to object file
ea kernel.ea
# Compile and link to executable
ea kernel.ea -o kernel
# Build shared library for Python/Rust/C++ consumption
ea kernel.ea --lib
# Cross-compile for ARM
ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu
# Emit LLVM IR for debugging
ea kernel.ea --emit-llvm
# Compile with AVX-512 support
ea kernel.ea --lib --avx512
# Cross-compile for ARM with dot product extension
ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu --dotprod
# Cross-compile for ARM with I8MM matrix multiply
ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu --i8mm
# Generate C header
ea kernel.ea --header
Compiler status output goes to stderr, so --emit-llvm stdout is clean for piping.
Bind
ea bind <file.ea> --python [--rust] [--cpp] [--pytorch] [--cmake]
Generate language bindings from a compiled kernel. At least one language flag is required.
Requires the .ea.json metadata file produced by ea <file.ea> --lib. Run the --lib compile first.
Language Flags
| Flag | Output | Description |
|---|---|---|
--python | <name>.py | Python wrapper using ctypes |
--rust | <name>.rs | Rust FFI bindings |
--cpp | <name>.hpp | C++ header with inline wrappers |
--pytorch | <name>_torch.py | PyTorch custom op wrapper |
--cmake | CMakeLists.txt | CMake build file for C++ integration |
Example
# Full workflow: compile, then generate Python bindings
ea kernel.ea --lib
ea bind kernel.ea --python
Inspect
ea inspect <file.ea> [target flags]
Post-optimization analysis of the compiled kernel. Shows instruction mix, loop structure, vector width usage, and register pressure. Accepts the same target flags as compile (--target, --target-triple, --avx512, --dotprod, --i8mm).
Example
ea inspect kernel.ea
ea inspect kernel.ea --avx512
ea inspect kernel.ea --target-triple=aarch64-unknown-linux-gnu
Print Target
ea --print-target
Print the resolved native CPU name for the current machine. Useful for understanding which CPU features the compiler will target by default.
Python API
The ea Python package (pip install ea-compiler) provides a high-level interface for compiling and loading Eä kernels.
Functions
ea.load
ea.load(path, *, target="native", opt_level=3, avx512=False) -> KernelModule
Compile an .ea file to a shared library and load it. Returns a KernelModule object with each exported function available as a method.
| Parameter | Type | Default | Description |
|---|---|---|---|
path | str | required | Path to the .ea source file |
target | str | "native" | LLVM CPU name (e.g. "skylake", "native") |
opt_level | int | 3 | Optimization level 0--3 |
avx512 | bool | False | Enable AVX-512 types and intrinsics |
import ea
import numpy as np
k = ea.load("kernel.ea")
a = np.array([1.0, 2.0, 3.0], dtype=np.float32)
b = np.zeros(3, dtype=np.float32)
k.my_func(a, b, len(a))
ea.compile
ea.compile(path, *, emit_asm=False, emit_llvm=False, target="native", opt_level=3, avx512=False, lib=True) -> Path
Compile an .ea file without loading it. Returns the path to the output file. Useful when you need the .so or .ea.json for another tool.
| Parameter | Type | Default | Description |
|---|---|---|---|
path | str | required | Path to the .ea source file |
emit_asm | bool | False | Also write a .s assembly file |
emit_llvm | bool | False | Also write a .ll LLVM IR file |
target | str | "native" | LLVM CPU name |
opt_level | int | 3 | Optimization level 0--3 |
avx512 | bool | False | Enable AVX-512 |
lib | bool | True | Produce .so + .ea.json |
ea.clear_cache
ea.clear_cache(path=None)
Clear the compilation cache. If path is given, clear only the cache for that .ea file. If None, clear the entire cache directory.
ea.compiler_version
ea.compiler_version() -> str
Return the version string of the ea compiler binary.
ea.__version__
ea.__version__ -> str
The version of the ea Python package.
Exceptions
ea.CompileError
Raised when compilation fails. Inherits from RuntimeError.
| Attribute | Type | Description |
|---|---|---|
stderr | str | Full compiler error output |
exit_code | int | Compiler process exit code |
try:
k = ea.load("broken.ea")
except ea.CompileError as e:
print(e.stderr)
print(e.exit_code)
Caching
Compiled shared libraries are cached in a __eacache__/ directory next to the .ea source file. The cache key includes the CPU name and compiler version:
__eacache__/{cpu}-{version}/kernel.so
The cache is invalidated by file modification time (mtime). If the source file is newer than the cached library, ea.load() recompiles automatically.
Length Collapsing
When a function parameter named n, len, length, count, size, or num appears immediately after a pointer parameter and has an integer type, the Python binding automatically fills it from the array's length. You do not need to pass it explicitly.
// Eä source
export func scale(data: *mut f32, n: i32, factor: f32) { ... }
# Python: n is auto-filled from len(data)
k.scale(data, factor=2.0)
Output Allocation
Parameters annotated with out and a [cap: ...] clause are automatically allocated by the Python binding and returned as the function's result.
// Eä source
export func transform(input: *f32, n: i32, out result: *mut f32 [cap: n]) { ... }
# Python: result is allocated and returned
output = k.transform(input_array)
If [count: path] is also specified, the returned array is trimmed to the actual output length.
Thread Safety
Loaded kernel modules and their functions are safe for concurrent use from multiple threads. The compiled code is stateless -- all memory is caller-provided via arguments.
Binding Annotations
Eä generates language bindings from compiled kernel metadata. This page documents the annotation syntax, the .ea.json metadata format, and the available binding generators.
Output Annotations
Mark a parameter as caller-allocated output with the out keyword and a capacity clause:
export func filter(
input: *f32,
n: i32,
out result: *mut f32 [cap: n, count: out_count],
out out_count: *mut i32 [cap: 1],
) { ... }
Syntax
out name: *mut T [cap: <expr>]
out name: *mut T [cap: <expr>, count: <path>]
| Clause | Required | Description |
|---|---|---|
cap | Yes | Number of elements to allocate. Can reference other parameters (e.g. n, width * height) |
count | No | Parameter or expression giving the actual output length. The binding trims the returned array to this length |
In generated bindings, out parameters are not part of the function signature. They are allocated internally and returned as the function's result.
Length Collapsing
When a pointer parameter is followed immediately by an integer parameter whose name matches one of the recognized patterns, the binding generators automatically fill the integer from the array's length.
Recognized Names
n, len, length, count, size, num
Rules
- The integer parameter must appear immediately after a pointer parameter
- The integer parameter must have an integer type (
i32,i64,u32, etc.) - The pointer and integer are collapsed into a single array argument in the binding
// These two parameters collapse into one array argument:
export func sum(data: *f32, n: i32) -> f32 { ... }
# Python: just pass the array, n is filled automatically
result = k.sum(my_array)
Metadata Format (.ea.json)
Compiling with --lib produces a .ea.json file alongside the shared library. This file describes the exported API and is consumed by ea bind.
{
"library": "libkernel.so",
"exports": [
{
"name": "scale",
"args": [
{"name": "data", "type": "*mut f32"},
{"name": "n", "type": "i32"},
{"name": "factor", "type": "f32"}
],
"return_type": null
}
],
"structs": [
{
"name": "Point",
"fields": [
{"name": "x", "type": "f32"},
{"name": "y", "type": "f32"}
]
}
]
}
Output-annotated parameters include additional fields:
{
"name": "result",
"type": "*mut f32",
"output": true,
"cap": "n",
"count": "out_count"
}
Binding Generators
Generate bindings with ea bind <file.ea> and one or more language flags. The .ea.json file must exist (run ea <file.ea> --lib first).
| Flag | Output File | Description |
|---|---|---|
--python | <name>.py | Python wrapper using ctypes. Arrays as NumPy ndarrays. |
--rust | <name>.rs | Rust extern "C" declarations with safe wrapper functions |
--cpp | <name>.hpp | C++ header-only bindings with std::span parameters |
--pytorch | <name>_torch.py | PyTorch custom op with torch.Tensor parameters |
--cmake | CMakeLists.txt | CMake project for linking the shared library in C++ |
All generated bindings use C ABI function pointers loaded from the shared library at runtime.
ARM / NEON
Eä supports AArch64 with NEON vector instructions. This page documents the differences from x86 and how to write portable kernels.
Vector Width
ARM NEON provides 128-bit vector registers, plus 64-bit D-registers for narrower operations. Available vector types on ARM:
| 128-bit (standard) | 64-bit (NEON D-registers) | Not Supported |
|---|---|---|
f32x4 | i8x8, u8x8 | f32x8, f32x16 |
f64x2 | i16x4, u16x4 | f64x4 |
i32x4, u32x4 | i32x2 | i32x8, i32x16 |
i16x8, u16x8 | i16x16 | |
i8x16, u8x16 | i8x32, u8x32 |
64-bit vector types are ARM-only. Using them on x86 produces a compile error. 256-bit and 512-bit types are x86-only and produce a compile error on ARM. This is intentional -- Eä does not silently fall back to scalar code.
Unavailable Intrinsics
The following intrinsics are x86-only and produce a compile error on ARM:
| Intrinsic | Reason |
|---|---|
movemask(v) | No ARM equivalent for extracting lane sign bits to a bitmask |
gather(ptr, indices) | No hardware gather support in NEON |
scatter(ptr, indices, values) | AVX-512 only |
maddubs_i16(a, b) | x86 PMADDUBSW instruction, no NEON equivalent |
maddubs_i32(a, b) | x86 specific |
round_f32x8_i32x8(a) | AVX2 (256-bit); use round_f32x4_i32x4 on ARM |
pack_sat_i32x8(a, b) | AVX2 (256-bit); use pack_sat_i32x4 on ARM |
pack_sat_i16x16(a, b) | AVX2 (256-bit); use pack_sat_i16x8 on ARM |
All other intrinsics (loads, stores, math, reductions, splat, select, shuffle, conversions) work on ARM.
ARM-Specific Intrinsics
Dot Product (ARMv8.2-A, --dotprod)
| Intrinsic | Signature | Description |
|---|---|---|
vdot_i32(a, b) | (i8x16, i8x16) -> i32x4 | Signed dot product, groups of 4 per lane. Maps to NEON sdot. |
I8MM Matrix Multiply (ARMv8.6-A, --i8mm)
| Intrinsic | Signature | Description |
|---|---|---|
smmla_i32(acc, a, b) | (i32x4, i8x16, i8x16) -> i32x4 | Signed x signed 2x8 x 8x2 matrix multiply-accumulate |
ummla_i32(acc, a, b) | (i32x4, u8x16, u8x16) -> i32x4 | Unsigned x unsigned matrix multiply-accumulate |
usmmla_i32(acc, a, b) | (i32x4, u8x16, i8x16) -> i32x4 | Unsigned x signed matrix multiply-accumulate |
The accumulator is explicit. First call uses splat(0) for zero-init. Available on Cortex-A78+, Apple M1+.
Absolute Difference (base NEON)
| Intrinsic | Signature | Description |
|---|---|---|
abs_diff(a, b) | (T, T) -> T | Element-wise absolute difference. Returns ` |
Supported types: i8x16, u8x16, i16x8, u16x8, i32x4, u32x4. Maps to NEON sabd/uabd (one instruction). No x86 equivalent -- use max(a .- b, b .- a) explicitly on x86.
Widening Multiply (base NEON)
| Intrinsic | Signature | Description |
|---|---|---|
wmul_i16(a, b) | (i8x8, i8x8) -> i16x8 | Signed widening multiply |
wmul_u16(a, b) | (u8x8, u8x8) -> u16x8 | Unsigned widening multiply |
wmul_i32(a, b) | (i16x4, i16x4) -> i32x4 | Signed widening multiply |
wmul_u32(a, b) | (u16x4, u16x4) -> u32x4 | Unsigned widening multiply |
Input types are 64-bit NEON vectors (D-registers). Output is 128-bit. Maps to NEON smull/umull. No x86 equivalent.
The --dotprod and --i8mm flags enable their respective extensions. Using an intrinsic without its flag produces a compile error with a hint to add it.
Cross-Platform Intrinsics
These intrinsics work on both x86 and ARM with identical semantics:
| Intrinsic | Signature | x86 instruction | ARM instruction |
|---|---|---|---|
shuffle_bytes(table, idx) | (u8x16, u8x16) -> u8x16 | SSSE3 pshufb | NEON tbl |
sat_add(a, b) | (T, T) -> T | SSE2 padds/paddus | NEON sqadd/uqadd |
sat_sub(a, b) | (T, T) -> T | SSE2 psubs/psubus | NEON sqsub/uqsub |
sat_add/sat_sub support i8x16, u8x16, i16x8, u16x8. Signed vs unsigned saturation is determined by the element type. No feature flags required (base SSE2 and NEON).
Cross-Compilation
Compile for ARM from an x86 host:
ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu
For intrinsics requiring hardware extensions, add the appropriate flag:
ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu --dotprod
When compiling natively on an ARM machine, no special flags are needed (except --dotprod for dot product intrinsics):
ea kernel.ea --lib
ea kernel.ea --lib --dotprod # for vdot_i32
ea kernel.ea --lib --i8mm # for smmla_i32, ummla_i32, usmmla_i32
The --avx512 flag is rejected on ARM targets with a compile error. The --i8mm and --dotprod flags are rejected on x86 targets.
Writing Portable Kernels
Strategy 1: Use 128-bit types everywhere
Use f32x4, i32x4, etc. These work on both x86 (SSE) and ARM (NEON).
kernel scale(data: *mut f32, factor: f32) range(n) step(4) {
let v: f32x4 = load(data, i);
let f: f32x4 = splat(factor);
store(data, i, v .* f);
}
This sacrifices throughput on x86 (which could use f32x8 with AVX2) but runs everywhere.
Strategy 2: Separate kernel files
Write platform-specific kernels in separate files and load the right one at runtime:
# kernel_x86.ea -- uses f32x8 for AVX2
# kernel_arm.ea -- uses f32x4 for NEON
import platform
if platform.machine() == "aarch64":
k = ea.load("kernel_arm.ea")
else:
k = ea.load("kernel_x86.ea")
Both files export the same function signatures, so the calling code does not change.
Strategy 3: Use the kernel construct
The kernel construct with step(N) lets you pick the vector width per file while keeping the loop logic identical. Write two .ea files with different step sizes and vector types, but the same exported function name and parameters.
128-bit Pack and Round on NEON
For 128-bit NEON equivalents of the x86 AVX2 pack/round intrinsics, use the cross-platform 128-bit variants:
round_f32x4_i32x4: round-to-nearest-evenf32x4toi32x4. x86:cvtps2dq. ARM:fcvtns.pack_sat_i32x4: saturating narrow(i32x4, i32x4) -> i16x8. x86:packssdw. ARM:sqxtn.pack_sat_i16x8: saturating narrow(i16x8, i16x8) -> i8x16. x86:packsswb. ARM:sqxtn.
The 256-bit variants (round_f32x8_i32x8, pack_sat_i32x8, pack_sat_i16x16) are x86-only and produce a compile error on ARM.
ea bench
Builds an .ea kernel and a C harness, runs the harness pinned to one core (Linux), captures JSONL measurements, wraps them with environment metadata, and (when a baseline exists) diffs against it.
Usage
ea bench <manifest.toml> [--target=CPU] [--avx512|--fp16|--i8mm|--dotprod]
[--opt-level=N] [--update-baseline] [--no-diff]
[--out PATH]
A benchmark is a triple: an .ea kernel (--lib-compatible), a C harness that links against the kernel's shared library, and a manifest TOML that names them.
Manifest schema
*.bench.toml files use this flat schema:
| key | type | required | meaning |
|---|---|---|---|
name | string | yes | Benchmark name (also the kernel's lib name) |
kernel | string (path) | yes | Path to .ea, relative to the manifest |
harness | string (path) | yes | Path to the C harness |
baseline | string (path) | yes | Path to the committed baseline JSON |
arch | array of string | yes | "x86_64" and/or "aarch64" — platforms where the kernel applies |
ea_flags | array of string | no | Extra flags for the kernel build (e.g. ["--fp16"]) |
cc_flags | array of string | no | Extra flags for the harness build (default ["-O2"]) |
Unknown keys are an error. Paths resolve relative to the manifest's parent directory; absolute paths pass through unchanged.
Harness contract
A harness is a normal C program that links against the kernel's shared library. It must:
- Write JSONL measurements to stdout, one per line. Required keys:
kernel(string),median_ns(integer). Optional:p10_ns,p90_ns,n_inner,n_runs. Any additional keys are passed through into the result JSON. - Send everything else to stderr. Banners, verify-OK / verify-FAIL messages, debug output, sink values.
ea benchrelays harness stderr with a[harness]prefix to its own stderr. - Exit 0 on success, non-zero on fatal error.
ea benchpropagates a non-zero exit.
Example measurement line:
{"kernel":"softmax_poly","median_ns":12345,"p10_ns":12200,"p90_ns":12600,"n_inner":200,"n_runs":10}
See benchmarks/v1.11.0/exp_poly_f32_harness.c for a complete template: deterministic LCG-filled input, warmup, median of N runs of M inner calls, volatile sink to defeat dead-code elimination, verify against a reference implementation, JSONL summary at the end.
Output schema
ea bench emits a single JSON object (to stdout, or --out PATH):
{
"schema_version": 1,
"name": "exp_poly_f32",
"eacompute_version": "1.13.0",
"git_sha": "f2ca320",
"timestamp": "2026-05-14T10:23:00Z",
"env": {
"os": "linux", "arch": "x86_64", "host_cpu": "znver4",
"target_cpu": "native", "target_features": "+avx512f,+avx512vl,+avx512bw",
"opt_level": 3, "pinned": true
},
"measurements": [ /* harness JSONL lines, re-emitted */ ]
}
| field | source |
|---|---|
schema_version | always 1 for this release |
name | manifest name |
eacompute_version | CARGO_PKG_VERSION of the running compiler |
git_sha | git rev-parse --short HEAD at run time; null if not in a git tree |
timestamp | ISO 8601 UTC |
env.host_cpu | LLVM TargetMachine::get_host_cpu_name() |
env.target_cpu, target_features, opt_level | resolved from CLI flags |
env.pinned | true if taskset was used (Linux only) |
Diff & baselines
If manifest.baseline exists, ea bench reads it and prints a per-kernel delta to stderr. The current regression threshold is 10%, warn-only in v1.13.0 — regressions print WARNING: but the process still exits 0.
exp_poly_f32 (x86_64, native, opt=3):
exp_only_libm 12345 ns (baseline 12200 ns, +1.2%)
exp_only_poly 4321 ns (baseline 4350 ns, -0.7%)
softmax_libm 98765 ns (baseline 96000 ns, +2.9%)
softmax_poly 32100 ns (baseline 31500 ns, +1.9%)
WARNING: 0 regressions exceed 10% threshold (warn-only in v1.13.0).
If the baseline's env.arch, env.target_features, or env.opt_level don't match the current run, the diff is skipped and stderr says baseline mismatch: .... Use --update-baseline to refresh.
If the baseline file doesn't exist (first run, or a freshly-added benchmark), stderr says no baseline yet — run with --update-baseline to create one.
Adding a new benchmark
- Write a kernel
foo_bench.eaexportingexport func ...measurement targets. - Write a C harness
foo_harness.cfollowing the harness contract above. - Write
foo.bench.tomlnext to the kernel and harness. - Capture the baseline:
ea bench foo.bench.toml --update-baseline - Commit kernel, harness, manifest, and baseline together.
Platform notes
- Pinning is Linux-only.
ea benchusestaskset -c 0when available. Mac / Windows runs reportpinned: falsein the result JSON, and measurements will be noisier. - Cross-platform manifests must use
arch = ["x86_64", "aarch64"]if the kernel + harness compile and run identically. Single-platform benchmarks use the matching single-element list. - The
ccbinary can be overridden via theCCenvironment variable. - Committed baselines reflect the maintainer's development host. CI runner measurements will diverge from these baselines — sometimes by 20%+ on libm-backed kernels — because of CPU model, microarchitecture, and shared-tenancy variance. In v1.13.0 the regression gate is warn-only specifically to collect this signal across one release before setting an enforced threshold.
WARNING: N regressions exceed 10% thresholdlines in CI artifacts are not (yet) actionable failures.
Eä vs NumPy
When does writing a kernel in Eä actually beat NumPy? The answer comes down to one thing: arithmetic intensity -- how many operations you perform per byte loaded from memory.
The memory bandwidth wall
Modern CPUs can process arithmetic far faster than they can load data from DRAM. A single DDR4 channel delivers ~30-40 GB/s. NumPy's ufuncs are already compiled C with SIMD -- for simple operations, they saturate this bandwidth just like hand-written SIMD would.
Rule of thumb: if your operation does fewer than 2 arithmetic ops per element loaded, it is bandwidth-bound. Eä will match NumPy but not beat it.
Bandwidth-bound: Eä matches, no win
Array scaling
export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
let f: f32x8 = splat(factor)
let mut i: i32 = 0
while i < n {
let v: f32x8 = load(src, i)
store(dst, i, v .* f)
i = i + 8
}
}
dst = src * factor # NumPy: one SIMD multiply per element, same speed
One multiply per element loaded. Both Eä and NumPy hit ~35 GB/s on typical hardware. No winner.
Simple element-wise ops
Any operation that loads an element, does one thing, and stores the result is bandwidth-bound:
dst[i] = src[i] + offsetdst[i] = abs(src[i])dst[i] = src_a[i] + src_b[i]
NumPy already handles these at memory bandwidth speed. Writing an Eä kernel gains you nothing.
Compute-bound: Eä wins
Fused scale + bias + clamp
NumPy must make three separate passes over memory:
dst = np.clip(src * scale + bias, 0.0, 1.0) # 3 temporaries, 3 passes
Each pass loads the full array, computes one operation, and writes a temporary. For a 100 MB array, that is 600 MB of memory traffic.
Eä fuses everything into a single pass:
export func fused_scale_bias_clamp(src: *f32, dst: *mut f32, scale: f32, bias: f32, n: i32) {
let s: f32x8 = splat(scale)
let b: f32x8 = splat(bias)
let zero: f32x8 = splat(0.0)
let one: f32x8 = splat(1.0)
let mut i: i32 = 0
while i < n {
let v: f32x8 = load(src, i)
let result: f32x8 = fma(v, s, b)
let clamped: f32x8 = min(max(result, zero), one)
store(dst, i, clamped)
i = i + 8
}
}
One load, one FMA, two comparisons, one store. The data stays in registers the entire time. This is 3-5x faster than the NumPy version on large arrays because it reads memory once instead of three times.
Stencil operations
Convolutions, Sobel filters, and any operation that reads multiple neighboring elements per output pixel have high arithmetic intensity. A 3x3 Sobel kernel reads 9 values and performs 9 multiplications plus 8 additions per output -- well above the compute-bound threshold.
See Image Processing for stencil patterns.
Custom reductions with branching
NumPy cannot express per-element branching in vectorized form. Operations like "accumulate values, but skip negatives and double values above a threshold" require Python-level loops or awkward np.where chains.
Eä gives you SIMD comparisons and masked operations in a single loop body, keeping the pipeline full.
Dot products and FMA chains
Any reduction that multiplies and accumulates benefits from FMA (fused multiply-add) and register-level accumulation:
let acc: f32x8 = fma(a, b, acc) // a * b + acc in one instruction
NumPy's np.dot is fast for large matrices (it calls BLAS), but for custom reductions, per-row operations, or non-standard accumulation patterns, Eä's explicit FMA beats NumPy's element-wise approach.
See ML Preprocessing for dot product and similarity patterns.
Decision checklist
Before writing an Eä kernel, ask:
- How many ops per element? If just one (scale, offset, abs), stay with NumPy.
- Does NumPy need multiple passes? If your expression chains 3+ operations, Eä's fusion wins.
- Is there a stencil or neighbor access? High arithmetic intensity -- Eä wins.
- Is there branching logic per element? NumPy cannot vectorize this -- Eä wins.
- Is it a standard BLAS operation? Use NumPy/SciPy -- they call optimized libraries.
Real-world packages
These packages demonstrate Eä beating NumPy on compute-bound workloads:
- easobel -- Sobel edge detection (stencil, ~9 ops/pixel)
- eastat -- CSV parsing (branching SIMD scan)
- eavec -- Vector similarity search (FMA-heavy dot products)
Image Processing
Image processing is one of Eä's strongest use cases. Stencil operations (convolution, edge detection, blur) read multiple neighbors per output pixel, giving high arithmetic intensity that keeps the CPU compute-bound rather than memory-bound.
Sobel edge detection
The Sobel filter computes horizontal and vertical gradients using 3x3 stencils. Each output pixel reads 9 input values and performs 9 multiplications plus additions -- well above the threshold where Eä beats NumPy.
The pattern: for each pixel, load the 3x3 neighborhood, multiply by Sobel coefficients, and sum:
export func sobel_x(
src: *f32, dst: *mut f32,
width: i32, height: i32
) {
let neg1: f32x8 = splat(-1.0)
let pos1: f32x8 = splat(1.0)
let neg2: f32x8 = splat(-2.0)
let pos2: f32x8 = splat(2.0)
let mut y: i32 = 1
while y < height - 1 {
let mut x: i32 = 0
while x < width - 2 {
let row_above: i32 = (y - 1) * width + x
let row_center: i32 = y * width + x
let row_below: i32 = (y + 1) * width + x
let tl: f32x8 = load(src, row_above)
let tr: f32x8 = load(src, row_above + 2)
let ml: f32x8 = load(src, row_center)
let mr: f32x8 = load(src, row_center + 2)
let bl: f32x8 = load(src, row_below)
let br: f32x8 = load(src, row_below + 2)
let gx: f32x8 = (tr .* pos1) .+ (mr .* pos2) .+ (br .* pos1)
.+ (tl .* neg1) .+ (ml .* neg2) .+ (bl .* neg1)
store(dst, row_center + 1, gx)
x = x + 8
}
y = y + 1
}
}
For a production-ready Sobel implementation with both Gx/Gy gradients and magnitude computation, see easobel.
Pixel pipeline: u8 to f32 and back
Images from disk arrive as u8 (0-255). SIMD math works best on f32. The typical pattern:
- Load u8 pixels
- Widen to f32 (0.0 to 255.0)
- Process in f32 (normalize, filter, blend)
- Narrow back to u8
Widening: u8 to f32
export func normalize_u8_to_f32(src: *u8, dst: *mut f32, n: i32) {
let scale: f32x4 = splat(0.00392156862)
let mut i: i32 = 0
while i < n {
let pixels: f32x4 = widen_u8_f32x4(src, i)
let normalized: f32x4 = pixels .* scale
store(dst, i, normalized)
i = i + 4
}
}
widen_u8_f32x4(ptr, offset) loads 4 bytes from src + offset, zero-extends each to 32 bits, and converts to float. The result is a f32x4 with values in 0.0 to 255.0. Multiply by 1/255 to get the 0.0-1.0 range.
Narrowing: f32 to u8
export func f32_to_u8(src: *f32, dst: *mut u8, n: i32) {
let s255: f32x4 = splat(255.0)
let zero: f32x4 = splat(0.0)
let mut i: i32 = 0
while i < n {
let v: f32x4 = load(src, i)
let clamped: f32x4 = min(max(v, zero), s255)
narrow_f32x4_i8(dst, i, clamped)
i = i + 4
}
}
narrow_f32x4_i8(ptr, offset, vec) converts 4 floats to integers, saturates to 0-255, and stores 4 bytes. Always clamp before narrowing to avoid overflow.
Saturating arithmetic on u8 pixels
For operations that stay in u8 (brightness adjustment, blending), use sat_add and sat_sub instead of widening to f32. These clamp to 0-255 in a single instruction on both ARM (NEON) and x86 (SSE2):
export func brighten(src: *u8, dst: *mut u8, boost: u8, n: i32) {
let b: u8x16 = splat(boost)
let mut i: i32 = 0
while i < n {
let pixels: u8x16 = load_u8x16(src, i)
let bright: u8x16 = sat_add(pixels, b)
store(dst, i, bright)
i = i + 16
}
}
No widening, no clamping, no f32 intermediates. One load, one saturating add, one store. This also works with i8x16 (signed), i16x8, and u16x8.
Putting it together
A full pixel pipeline (load u8, process in f32, store u8) processes 4 pixels per iteration on both x86 and ARM. Use f32x4 for the pipeline to keep it portable -- f32x8 works only on x86 with AVX2.
Why image stencils are compute-bound
A 3x3 convolution on a single-channel image performs 9 multiplications and 8 additions per output pixel, but only produces 1 output value. That is 17 arithmetic operations per output float -- far above the ~2 ops/element threshold where Eä's operation fusion matters.
For multi-channel images (RGB, RGBA), the arithmetic intensity is even higher because you process 3-4 channels per pixel position.
Compare this to simple brightness adjustment (pixel * 1.1), which is 1 op per element -- bandwidth-bound, and NumPy handles it just as fast. See Eä vs NumPy for more on this distinction.
Frame differencing with abs_diff (ARM)
On ARM, abs_diff computes per-pixel absolute difference in a single NEON instruction. Useful for motion detection and video anomaly:
export func frame_diff(a: *u8, b: *u8, dst: *mut u8, n: i32) {
let mut i: i32 = 0
while i < n {
let fa: u8x16 = load_u8x16(a, i)
let fb: u8x16 = load_u8x16(b, i)
let diff: u8x16 = abs_diff(fa, fb)
store(dst, i, diff)
i = i + 16
}
}
abs_diff supports i8x16, u8x16, i16x8, u16x8, i32x4, u32x4. ARM-only -- on x86, use max(a .- b, b .- a) explicitly.
Tips
- Border handling: the examples above skip border pixels (starting at y=1, ending at height-1). For production kernels, handle borders separately with scalar code or clamped indexing.
- Separable filters: Gaussian blur and similar filters can be split into horizontal and vertical passes, reducing a 3x3 stencil from 9 to 6 operations. Each pass is still compute-bound.
- ARM portability: use
f32x4andi32x4for kernels that need to run on both x86 and ARM. The 128-bit types work on both architectures.
Text Processing
Text processing benefits from SIMD when you can skip large regions of uninteresting bytes. The key pattern is chunk-skip: load 32 bytes at a time, check if any byte matches a target character, and skip the entire chunk if none match.
The chunk-skip pattern
Most bytes in a text file are not the character you are looking for. A newline scanner over a 1 MB file might find 10,000 newlines among 1,000,000 bytes -- 99% of chunks contain no match and can be skipped with a single comparison and branch.
export func count_newlines(data: *u8, n: i32) -> i32 {
let newline: u8x32 = splat_u8x32(10)
let mut count: i32 = 0
let mut i: i32 = 0
while i < n - 31 {
let chunk: u8x32 = load_u8x32(data, i)
let matches: u8x32 = chunk .== newline
let mask: i32 = movemask(matches)
if mask != 0 {
let mut j: i32 = 0
while j < 32 {
if load_u8(data, i + j) == 10 {
count = count + 1
}
j = j + 1
}
}
i = i + 32
}
let mut k: i32 = i
while k < n {
if load_u8(data, k) == 10 {
count = count + 1
}
k = k + 1
}
count
}
The structure:
- Load 32 bytes as
u8x32 - Compare with
splat_u8x32(target)using.==to get a match vector movemask()collapses the vector comparison to a singlei32bitmask- If mask is 0: no matches in this chunk, skip ahead 32 bytes (fast path)
- If mask is nonzero: scan the 32 bytes individually (slow path, but rare)
- Scalar tail: handle remaining bytes that do not fill a full 32-byte chunk
The fast path processes 32 bytes in about 3 instructions. On typical text, 90-99% of chunks take the fast path.
Why not extract individual bit positions?
Eä has element-wise vector shift operators (.<<, .>>), but movemask returns a scalar i32 bitmask. Extracting individual bit positions from a scalar requires scalar bitwise operations (&, >>) which work on i32 values. When a chunk contains matches, fall back to byte-by-byte scanning within that 32-byte window. This is still fast because:
- Hot chunks are rare (most text is not your target character)
- The 32-byte scan is a tight loop that fits in L1 cache
- The overall speedup comes from skipping cold chunks, not from optimizing hot ones
ARM portability
movemask is an x86-only intrinsic (it maps to vpmovmskb). On ARM/NEON, there is no equivalent single instruction. For portable kernels, write separate x86 and ARM versions:
- x86: use
u8x32+movemaskas shown above - ARM: use
u8x16with scalar fallback, or structure the algorithm to avoid needing a bitmask
See ARM / NEON for details on architecture-specific intrinsics.
CSV parsing
CSV parsing combines the chunk-skip pattern with state tracking. You need to find commas and newlines while respecting quoted fields -- a comma inside quotes is not a delimiter.
The approach:
- Scan for quote characters (
") using chunk-skip to track quote state - Scan for delimiters (
,and\n) using chunk-skip, filtering by quote state - Parse numeric fields from the delimited regions
This is compute-bound because each byte potentially involves comparison against multiple target characters plus state logic -- exactly the kind of branching work that NumPy cannot vectorize.
For a complete CSV statistics package built on this pattern, see eastat.
Tips
- Buffer alignment: SIMD loads from unaligned addresses work but may be slower on older hardware. If you control the buffer, align to 32 bytes.
- Tail handling: always include a scalar loop for the last
n % 32bytes. Forgetting the tail is the most common bug in SIMD text processing. - Multi-character search: to find any of several characters (e.g.,
<,>,&for HTML), do multiple.==comparisons and combine with.|before the movemask.
ML Preprocessing
ML preprocessing pipelines often apply the same sequence of operations to every element in large arrays: normalize, scale, compute similarities. These are natural fits for Eä kernels because they fuse multiple operations into single-pass loops.
Normalize: zero-mean, unit-variance
The standard normalization (x - mean) / std requires two operations per element. NumPy computes them as separate passes:
normalized = (data - mean) / std # 2 passes, 2 temporaries
Eä fuses this into one pass using FMA. Rewrite (x - mean) / std as x * (1/std) + (-mean/std):
export func normalize(
src: *f32, dst: *mut f32,
inv_std: f32, neg_mean_div_std: f32,
n: i32
) {
let s: f32x8 = splat(inv_std)
let b: f32x8 = splat(neg_mean_div_std)
let mut i: i32 = 0
while i < n {
let v: f32x8 = load(src, i)
let result: f32x8 = fma(v, s, b)
store(dst, i, result)
i = i + 8
}
}
The caller precomputes inv_std = 1.0 / std and neg_mean_div_std = -mean / std in Python. The kernel then does a single FMA per 8 elements -- one instruction that multiplies and adds simultaneously.
Dot product: dual-accumulator pattern
A naive dot product uses one accumulator:
let mut acc: f32x8 = splat(0.0)
while i < n {
let a: f32x8 = load(x, i)
let b: f32x8 = load(y, i)
acc = fma(a, b, acc)
i = i + 8
}
This leaves performance on the table. FMA has a latency of 4-5 cycles but a throughput of 1 per cycle. With one accumulator, each FMA waits for the previous one to finish.
Dual accumulators hide this latency by interleaving independent FMA chains:
export func dot_product(x: *f32, y: *f32, out: *mut f32, n: i32) {
let mut acc0: f32x8 = splat(0.0)
let mut acc1: f32x8 = splat(0.0)
let mut i: i32 = 0
while i < n - 15 {
let a0: f32x8 = load(x, i)
let b0: f32x8 = load(y, i)
acc0 = fma(a0, b0, acc0)
let a1: f32x8 = load(x, i + 8)
let b1: f32x8 = load(y, i + 8)
acc1 = fma(a1, b1, acc1)
i = i + 16
}
while i < n {
let a: f32x8 = load(x, i)
let b: f32x8 = load(y, i)
acc0 = fma(a, b, acc0)
i = i + 8
}
let combined: f32x8 = acc0 .+ acc1
let result: f32 = horizontal_sum(combined)
store(out, 0, splat(result))
}
The CPU can execute fma(a0, b0, acc0) and fma(a1, b1, acc1) in parallel because they use different accumulators. This typically doubles throughput on modern CPUs.
Cosine similarity
Cosine similarity needs three reductions in one pass: dot(a, b), norm(a), and norm(b). Computing these separately means three passes over the data. Eä fuses all three:
export func cosine_similarity(a: *f32, b: *f32, out: *mut f32, n: i32) {
let mut dot_acc: f32x8 = splat(0.0)
let mut norm_a_acc: f32x8 = splat(0.0)
let mut norm_b_acc: f32x8 = splat(0.0)
let mut i: i32 = 0
while i < n {
let va: f32x8 = load(a, i)
let vb: f32x8 = load(b, i)
dot_acc = fma(va, vb, dot_acc)
norm_a_acc = fma(va, va, norm_a_acc)
norm_b_acc = fma(vb, vb, norm_b_acc)
i = i + 8
}
let dot_val: f32 = horizontal_sum(dot_acc)
let norm_a_val: f32 = horizontal_sum(norm_a_acc)
let norm_b_val: f32 = horizontal_sum(norm_b_acc)
let result: f32 = dot_val / sqrt(norm_a_val * norm_b_val)
store(out, 0, splat(result))
}
Three FMAs per loop iteration, all operating on the same loaded vectors. The data passes through cache once. NumPy would need np.dot(a, b), np.linalg.norm(a), and np.linalg.norm(b) -- three separate passes.
Int8 quantized inference (ARM)
For quantized ML models using int8 weights, ARM provides dedicated matrix multiply instructions. On ARM with --i8mm (ARMv8.6-A, Cortex-A78+, Apple M1+):
export func matmul_i8_block(
acc: *mut i32, activations: *i8, weights: *i8, n: i32
) {
let mut a: i32x4 = splat(0)
let mut i: i32 = 0
while i < n {
let act: i8x16 = load(activations, i)
let wgt: i8x16 = load(weights, i)
a = smmla_i32(a, act, wgt)
i = i + 16
}
store(acc, 0, a)
}
smmla_i32 performs a 2x8 x 8x2 signed matrix multiply-accumulate in one instruction. Also available: ummla_i32 (unsigned x unsigned) and usmmla_i32 (unsigned activations x signed weights, the most common ML pattern).
For older ARM chips without I8MM, use vdot_i32 (requires --dotprod, ARMv8.2-A) for 4-way dot products, or wmul_i16/wmul_i32 for widening multiplies.
Fused Quantization Pipeline
Convert 32 float activations to int8 in 7 instructions, then feed directly to maddubs_i32:
kernel quantize_dot(activations: *const f32, weights: *const u8, out: *mut i32, inv_scale: f32, n: i32) {
let f0: f32x8 = load(activations, i * 32);
let f1: f32x8 = load(activations, i * 32 + 8);
let f2: f32x8 = load(activations, i * 32 + 16);
let f3: f32x8 = load(activations, i * 32 + 24);
let scale: f32x8 = splat(inv_scale);
let i0: i32x8 = round_f32x8_i32x8(f0 .* scale);
let i1: i32x8 = round_f32x8_i32x8(f1 .* scale);
let i2: i32x8 = round_f32x8_i32x8(f2 .* scale);
let i3: i32x8 = round_f32x8_i32x8(f3 .* scale);
let s01: i16x16 = pack_sat_i32x8(i0, i1);
let s23: i16x16 = pack_sat_i32x8(i2, i3);
let quant: i8x32 = pack_sat_i16x16(s01, s23);
let w: u8x32 = load(weights, i * 32);
let dot: i32x8 = maddubs_i32(w, quant);
store(out, i * 8, dot);
}
Pipeline: 4x round + 2x pack_i32 + 1x pack_i16 = 7 instructions for 32 floats to 32 int8. Then 1x maddubs_i32 = 8 total.
Fused Quantization Pipeline (ARM)
The 256-bit intrinsics above are x86-only. On ARM, use the 128-bit cross-platform variants to process 16 floats at a time:
kernel quantize_dot_arm(activations: *const f32, weights: *const i8, out: *mut i32, inv_scale: f32, n: i32) {
let f0: f32x4 = load(activations, i * 16);
let f1: f32x4 = load(activations, i * 16 + 4);
let f2: f32x4 = load(activations, i * 16 + 8);
let f3: f32x4 = load(activations, i * 16 + 12);
let scale: f32x4 = splat(inv_scale);
let i0: i32x4 = round_f32x4_i32x4(f0 .* scale);
let i1: i32x4 = round_f32x4_i32x4(f1 .* scale);
let i2: i32x4 = round_f32x4_i32x4(f2 .* scale);
let i3: i32x4 = round_f32x4_i32x4(f3 .* scale);
let s01: i16x8 = pack_sat_i32x4(i0, i1);
let s23: i16x8 = pack_sat_i32x4(i2, i3);
let quant: i8x16 = pack_sat_i16x8(s01, s23);
let w: i8x16 = load(weights, i * 16);
let dot: i32x4 = vdot_i32(splat(0), quant, w);
store(out, i * 4, dot);
}
Compile with --dotprod for vdot_i32. The round_f32x4_i32x4, pack_sat_i32x4, and pack_sat_i16x8 intrinsics are cross-platform and require no extra flags.
Batch operations
For ML workloads, you often apply the same operation to many rows. The Python side handles the loop over rows, calling the Eä kernel for each:
import ea
lib = ea.load("similarity.ea")
for i in range(num_queries):
lib.cosine_similarity(query[i], database[j], result, dim)
The kernel handles the inner loop (over vector dimensions) with SIMD. The outer loop (over queries or rows) stays in Python. This is the right split -- the inner loop is where SIMD matters.
Real-world package
eavec implements vector similarity search using the dual-accumulator FMA pattern. It computes cosine similarity across a database of vectors, returning the top-k most similar results.
Native f16 Inference
LLM weights ship as f16. KV cache rows are f16. Attention reads f16 every
step. Until v1.11.0, every one of those reads paid for a cvt_f16_f32
into an f32 register before any compute could happen — the storage was
narrow, but the SIMD math was wide. On Pi 5 (Cortex-A76 with FEAT_FP16
silicon) that round-trip is now optional. Compile with --fp16 and
f16x8 arithmetic lowers straight to NEON's fadd v.8h / fmul v.8h /
fmla v.8h — half the register pressure, no cvt, no widening.
This is opt-in. Existing code that uses cvt_f16_f32 / cvt_f32_f16 keeps
working unchanged — they're stable on any AArch64. --fp16 is the
"compute stays in half-precision the whole way" mode for hot paths that
load f16, multiply f16, and write f16 back.
Where the round-trip hurts
In an LLM attention kernel, every token's Q ⋅ Kᵀ inner product loads a row of cached keys. Storage is f16 — that's the whole point of a half KV cache. Compute is f32. The original portable shape looks like this:
// Before: f32 round-trip on every f16 load.
// Storage is f16, compute is f32, every load pays the cvt cost.
export func rmsnorm_via_f32(x: *i16, scale: f32, out: *mut i16) {
let v_h: i16x4 = load(x, 0)
let v: f32x4 = cvt_f16_f32(v_h)
let sq: f32x4 = v .* v
let sum: f32 = reduce_add(sq)
let n: f32 = 4.0
let eps: f32 = 0.000001
let denom: f32 = sqrt(sum / n + eps)
let inv: f32 = 1.0 / denom
let inv_v: f32x4 = splat(inv * scale)
let r: f32x4 = v .* inv_v
let r_h: i16x4 = cvt_f32_f16(r)
store(out, 0, r_h)
}
(Half-precision values are addressed as i16 here because the
cvt-based path predates the f16 type — cvt_f16_f32(i16x4) is
the documented bit-level entry point.)
That kernel works on any AArch64. But every cvt_f16_f32 is real work
— a fcvtl v.4s, v.4h instruction, plus the data spends the whole
compute pass in 128-bit f32 registers, halving how many lanes fit and
doubling register pressure on FMA chains.
What --fp16 changes
With the new f16 scalar / f16x4 / f16x8 types and --fp16 enabled,
the same shape stays in half-precision end to end:
// Native f16 — no f32 round-trip on the per-element path.
// Storage is f16; compute is f16 except for the single scalar sqrt
// (no useful f16 scalar sqrt on Cortex-A76; the round-trip is one
// operation, not N).
export func rmsnorm_f16(x: *f16, scale: f16, out: *mut f16) {
let v: f16x8 = load(x, 0)
let sq: f16x8 = v .* v
let sum: f16 = reduce_add(sq)
// Scalar branch: sum/N + eps, sqrt, reciprocal. Done in f32 because
// there is no single-lane f16 sqrt instruction on Cortex-A76.
let n: f16 = to_f16(8.0)
let mean_h: f16 = sum / n
let mean_f: f32 = to_f32(mean_h)
let eps: f32 = 0.000001
let inv_f: f32 = 1.0 / sqrt(mean_f + eps)
let inv_h: f16 = to_f16(inv_f)
// Back to f16x8 for the per-element multiply.
let inv_v: f16x8 = splat(inv_h)
let scale_v: f16x8 = splat(scale)
let r: f16x8 = v .* inv_v .* scale_v
store(out, 0, r)
}
Build with:
ea rmsnorm.ea --lib --fp16 --target-triple=aarch64-unknown-linux-gnu
The vector multiplies and the reduction now lower to fmul v.8h and
faddv h, v.8h directly — the LLVM IR has <8 x half> everywhere on
the per-lane path, and fpext only appears around the scalar sqrt. Eight
lanes fit in a single Q register instead of eight lanes spread across two
f32 registers, so chained FMAs hit twice the per-cycle throughput on the
A76's pipelines.
Attention dot-product
The same idea applies to the inner product over the KV row. The pre---fp16
version converted on every load; the native form fuses without the cvt:
// Attention dot product over an f16 KV cache row.
// Before --fp16: each f16x8 load became cvt_f16_f32 to an f32x8 register
// before any compute. With --fp16, the cvt is gone — fma runs directly
// on <8 x half> registers.
export func attn_dot_f16(q: *f16, k: *f16, n: i32) -> f16 {
let mut acc: f16x8 = splat(to_f16(0.0))
let mut i: i32 = 0
while i < n {
let qv: f16x8 = load(q, i)
let kv: f16x8 = load(k, i)
acc = fma(qv, kv, acc)
i = i + 8
}
return reduce_add(acc)
}
Each loop iteration is one ld1 v.8h × 2, one fmla v.8h, no cvt. The
load-to-FMA chain stays in 128-bit Q registers the whole way.
When to reach for it
Use native f16 when:
- Storage is f16 and compute is dominated by element-wise multiply-add (KV cache reads in attention, weight-times-activation in MLP rows, RMSNorm scaling, RoPE rotation).
- The target hardware actually has FEAT_FP16 (Cortex-A76 / A78 / X1+,
Apple M-series, Neoverse V1+). On hardware without it,
--fp16errors before codegen — there's no silent fallback. - You measure first. The win comes from register pressure and SIMD width, not from the cvt instruction itself; bandwidth-bound shapes that already saturate memory won't speed up.
Keep the f32 round-trip when:
- A single scalar sqrt / divide / reciprocal lives on the critical path.
The cvt is one instruction; f32's better-conditioned scalar math is
worth more than the cvt costs. The
rmsnorm_f16example above does exactly this for the per-batch sqrt. - The kernel needs to ship to non-FEAT_FP16 hardware. The
cvt_f16_f32/cvt_f32_f16path is the portable form and does not go away in v1.11.0 — it works with or without--fp16.
Performance
Quantitative numbers ship with Phase 6 of the v1.11.0 audit
(post-merge). The architectural story is straightforward: the cvt is
gone from the load-to-compute path and the lane count per register
doubles from 4 to 8, so attention / RMSNorm / RoPE hot paths fall to
roughly the same per-instruction shape as a hand-written
vfmlal_low_f16 chain. Olorin's gemma4 inference uses this path under
--fp16. See the spec for the design rationale.
See also
- End-to-end test fixture:
tests/data/rmsnorm_f16.ea - ARM FP16 test suite:
tests/phase14_arm_fp16.rs
Fast Transcendentals: exp_poly_f32
Eä's exp() intrinsic calls libm via llvm.exp.v*f32. LLVM has no
hardware vector exp on any current ISA — not on AVX2, not on AVX-512,
not on NEON, not on SVE. So exp(f32x8) always lowers to a loop of
eight sequential expf calls. The SIMD pattern in the source is
cosmetic: throughput stays at one element per scalar libm call.
In a softmax loop or a tanh-GELU activation, that scalarization is the
entire kernel. exp_poly_f32(f32xN) -> f32xN (new in v1.11.0) replaces
it with a polynomial that stays in SIMD registers — seven to eight FMAs
per lane, no libm call, no scalarization.
The throughput win depends on the baseline:
- Modern x86 with a fast
expfin glibc (AMD Zen 4 / glibc 2.42 on our reference benchmark): 2.93× isolated, 2.60× inside softmax. - Pi 5 (ARM Cortex-A76, glibc
expf, nolibmvec) measured inside a real GELU kernel in Olorin: 2.23× end-to-end ongemma4_gelu, consistent across 64-12288-lane shapes. - Older libm or scalar-only environments without a vectorized
expf: the gap widens — the spec's original "~10×" headline holds against the slowest baselines but does not match every modern glibc.
The win is real on every baseline measured; the magnitude is
environment-sensitive (glibc 2.42's expf is faster than the spec
assumed, which compresses the headline).
The contract
exp_poly_f32 is not a drop-in replacement for exp(). It trades
two things for the throughput:
- Bounded input range. It is defined on
[-50, 50]per lane. Outside that range the polynomial diverges — no NaN or Inf guarantees, no clamping. The caller clamps if their inputs may exceed. - Bounded accuracy. Relative error is ≤ 2⁻¹⁸ (~3.8e-6) inside the
safe range. That's enough for softmax (normalize-by-sum absorbs the
error) and for tanh-GELU activations. It is not enough for
anything that needs full f32 precision — keep
exp()for those.
The name encodes the trade. Reading exp_poly_f32 at a call site, you
know it's a polynomial on f32 lanes — same explicitness style as
pack_sat_i32x8, widen_u8_i32x4, cvt_f16_f32.
Softmax
The canonical use case is numerically-stable softmax over a small fixed window. The maximum subtract keeps inputs in a tight range, exp is the dominant cost, and the final normalize absorbs the polynomial's relative error.
// Stable softmax over 8 lanes using exp_poly_f32.
// reduce_max keeps inputs in a tight range so we never approach the
// [-50, 50] contract edge — the test fixture uses values 1..8 and the
// shifted range is [-7, 0].
export func softmax(x: *f32, out: *mut f32) {
let v: f32x8 = load(x, 0)
let mx: f32 = reduce_max(v)
let mxv: f32x8 = splat(mx)
let shifted: f32x8 = v .- mxv
let ev: f32x8 = exp_poly_f32(shifted)
let s: f32 = reduce_add(ev)
let inv: f32 = 1.0 / s
let invv: f32x8 = splat(inv)
let r: f32x8 = ev .* invv
store(out, 0, r)
}
The full integration test for this shape (test_exp_poly_f32_softmax_integration
in tests/phase14_exp_poly.rs) compares against expf-based reference
softmax with relative tolerance 1e-3 (~2⁻¹⁰) — the polynomial's 2⁻¹⁸
error compounds through reduce_add, but the normalize-by-sum still
collapses it to a tolerance that's well below softmax's typical
numerical envelope.
tanh-GELU via tanh_approx_f32
The GELU activation gelu(x) ≈ 0.5 · x · (1 + tanh(c · (x + 0.044715 · x³)))
with c = sqrt(2/π) is the canonical use of a fast vector tanh.
tanh_approx_f32(f32xN) -> f32xN (new in v1.14.0) lowers to a rational
P(x²) · x / Q(x²) minimax-tuned for f32, branchless and bounded to
~3e-7 absolute error across the body, with internal clamping at ±9
handling saturation:
// GELU(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
export func gelu_tanh(input: *f32, output: *mut f32) {
let x: f32x8 = load(input, 0)
// inner = sqrt(2/pi) * (x + 0.044715 * x^3)
let c0: f32x8 = splat(0.7978845608)
let c1: f32x8 = splat(0.044715)
let xsq: f32x8 = x .* x
let xcube: f32x8 = xsq .* x
let inner: f32x8 = c0 .* fma(c1, xcube, x)
// tanh_approx_f32 clamps internally; no defensive bound needed.
let t: f32x8 = tanh_approx_f32(inner)
// gelu = 0.5 * x * (1 + t)
let half: f32x8 = splat(0.5)
let one: f32x8 = splat(1.0)
let result: f32x8 = half .* x .* (one .+ t)
store(output, 0, result)
}
This replaces the earlier (exp_poly_f32(2x) - 1) / (exp_poly_f32(2x) + 1)
algebraic-identity workaround. The workaround suffers catastrophic
cancellation in the numerator for small |x| (e^{2·1e-3} - 1 loses
~3 digits of precision in f32), degrading to ~10⁻³ relative error near
zero where GELU is most sensitive. The dedicated intrinsic skips both
that pitfall and the second arithmetic chain, while costing one fdiv
instead of one fdiv plus the exp_poly_f32 polynomial.
Safely clamping in-range
The contract is bounded on [-50, 50] per lane, so the safest defensive
pattern is the min-then-max chain shown above:
let hi: f32x8 = splat(50.0)
let lo: f32x8 = splat(-50.0)
let safe: f32x8 = min(max(unbounded, lo), hi)
let e: f32x8 = exp_poly_f32(safe)
Both min and max lower to single NEON fminnm / fmaxnm (or x86
vminps / vmaxps) instructions, so the clamp is two FMAs of overhead
on a polynomial that's already 7-8 FMAs per lane.
If you can prove statically that your input range is bounded — softmax
after a reduce_max subtract, for instance — skip the clamp. The
guarantee is mathematical, not enforced; the polynomial does not check.
When to keep exp()
exp_poly_f32 is the right tool for SIMD hot paths where 2⁻¹⁸ relative
error is acceptable. Examples: softmax, GELU, attention scoring,
exponential decay in beam search.
Keep exp() (which calls libm) for:
- Scalar code, or vector code where the SIMD scalarization isn't visible (rare loops, one-shot computations).
- Code that needs full f32 precision — physics, statistics, log-sum-exp inside an accuracy-critical reduction.
- Inputs that may exceed
[-50, 50]and a libm-correct NaN/Inf is required.
See also
- Test fixtures (accuracy, range, softmax integration):
tests/phase14_exp_poly.rs
NEON Gather Workaround
gather(p: *f32, idx: i32xN) -> f32xN lowers to AVX2's vgatherdps
on x86 — one instruction, four (or eight) indexed loads, one result
vector. On AArch64 NEON there is no such instruction. Pi 5's Cortex-A76
has no SVE / SVE2 either, so there's no ld1w {z0.s}, p0/z, [x1, z1.s, sxtw] fallback. A real gather is hardware-impossible on this target.
Pre-v1.11.0, Eä errored on gather() for ARM with a message that said
"use a scalar loop on ARM" — true but unhelpful, and the de-facto
workaround everyone arrived at (a stack-buffer round-trip through
memory) was strictly worse than the canonical scalar-compose pattern
already used by llama.cpp's IQ kernels.
v1.11.0 ships two new vector-construction intrinsics —
f32x4_from_scalars and f32x8_from_scalars — and rewrites the ARM
gather() error to point at them by name plus
docs/idioms/neon-gather.md. The pattern
that used to be "figure it out yourself" is now a one-line compose with
a documented idiom.
The IQ3 motivation
Olorin's IQ3_S / IQ3_XXS dequant kernels need exactly this. Each block holds packed 3-bit indices into a 256-entry lookup table of dequantized f32 values. The hot loop reads four indices, fetches four LUT entries, and writes them out as an f32x4 vector.
On x86 the kernel is one line:
// AVX2 gather: works on x86 but errors on ARM with a pointer to f32x{4,8}_from_scalars.
export func lut_dequant_x86(lut: *f32, idx: i32x4, out: *mut f32) {
let v: f32x4 = gather(lut, idx)
store(out, 0, v)
}
That kernel hard-errors on AArch64 at codegen time with:
gather has no NEON equivalent on ARM. Use scalar load_u32 + f32x4_from_scalars (or f32x8_from_scalars) to compose the result explicitly. See docs/idioms/neon-gather.md for the canonical pattern.
The compose pattern
The workaround is to read the four (or eight) values one at a time and build the vector explicitly. The new intrinsic does the building:
// IQ3 LUT dequant pattern: gather four entries from a lookup table.
export func iq3_lut_dequant_lane(lut: *f32, indices: *i32, out: *mut f32) {
let i0: i32 = indices[0]
let i1: i32 = indices[1]
let i2: i32 = indices[2]
let i3: i32 = indices[3]
let v0: f32 = lut[i0]
let v1: f32 = lut[i1]
let v2: f32 = lut[i2]
let v3: f32 = lut[i3]
let v: f32x4 = f32x4_from_scalars(v0, v1, v2, v3)
store(out, 0, v)
}
The same shape works for an 8-wide gather over an i32x8 index vector
via f32x8_from_scalars(...) — same idea, eight scalar loads instead
of four.
For the real per-row loop, the four scalar reads sit visibly inside an ordinary while loop:
// IQ3 LUT dequant row: walk an index array, fetch 4 LUT entries at a time,
// store into the output stream. The four scalar loads sit visibly in the
// source — the programmer sees the cost.
export func iq3_dequant_row(lut: *f32, indices: *i32, out: *mut f32, n: i32) {
let mut i: i32 = 0
while i < n {
let v0: f32 = lut[indices[i]]
let v1: f32 = lut[indices[i + 1]]
let v2: f32 = lut[indices[i + 2]]
let v3: f32 = lut[indices[i + 3]]
let v: f32x4 = f32x4_from_scalars(v0, v1, v2, v3)
store(out, i, v)
i = i + 4
}
}
Build with --target-triple=aarch64-unknown-linux-gnu. LLVM's NEON
lowering folds the insertelement chain inside f32x4_from_scalars
into a sequence of ins v.s[i] instructions — the same code GCC and
Clang produce for vsetq_lane_f32 chains. No stack buffer, no
load-and-shuffle dance.
Why no silent fallback?
Eä's design rule is "the programmer sees the cost." A silent scalar
fallback for gather() on ARM would hide a 4-8× slowdown behind syntax
that looks like SIMD. The compose pattern keeps the cost explicit:
four scalar reads appear as four scalar reads in the source. A
profiler shows four ldr s0, [x1, x2, lsl #2] instructions, not a
deceptively single-line gather that hides them. Reviewer reading the
diff sees the loads. Future you reading the source in six months sees
the loads. There is no hidden performance cliff.
This is the same philosophy that gates --fp16 behind a flag and
errors on f16x8 arithmetic without it: explicit cost beats convenient
syntax that masks slowdown.
When SVE2 lands
If your ARM target is Apple M4, Graviton 3+, or Snapdragon X, SVE2 does
have a real gather (ld1w with a scatter-gather addressing mode). Eä's
SVE2 codegen is deferred to a future release. Until then, the compose
pattern is the universal AArch64 workaround: it works on Pi 5 today, and
it stays correct on M4 / Graviton even after a hardware-gather lowering
lands (it just gets superseded by the better path on that target).
See also
- Terse reference idiom:
docs/idioms/neon-gather.md - Test coverage:
tests/phase14_arm_neon.rs(test_f32x4_from_scalars,test_f32x8_from_scalars,test_gather_on_arm_points_to_compose)
Beating NumPy's BLAS at Constant-Q Transform with 80 Lines of Eä
A SIMD kernel compiler, a DFT nobody asked for, and an honest benchmark.
The Problem
Audio spectrum visualizers typically use FFT. FFT gives you linearly-spaced frequency bins — great for math, terrible for music. Human pitch perception is logarithmic: the distance from C2 to C3 (one octave) matters as much as C5 to C6. But FFT allocates equal resolution across the entire spectrum, wasting detail in the bass and over-resolving the treble.
The Constant-Q Transform (CQT) solves this. Each frequency bin gets its own window length — long windows for low frequencies (more cycles needed to resolve pitch), short windows for high frequencies. The result: 84 bins spanning 7 octaves, one per semitone, C2 through B8.
The catch: FFT can't do CQT. You need a DFT-per-bin, which is O(n*k) instead of O(n log n). The standard NumPy approach is to build a complex kernel matrix and let BLAS handle it with a single matrix-vector multiply.
We wanted to see if a hand-written SIMD kernel in Eä could beat that.
What is Eä
Eä is a compute kernel compiler. You write .ea files in a C-like language with explicit SIMD vector types, compile them to native shared libraries, and call them from Python with NumPy arrays. No C toolchain, no Cython, no JIT warmup.
export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
let s: f32x8 = splat(factor)
let mut i: i32 = 0
while i < n {
let v: f32x8 = load(src, i)
store(dst, i, v .* s)
i = i + 8
}
}
Eä doesn't have sin or cos intrinsics. This is a deliberate design choice — trig is a policy decision (how many polynomial terms? what accuracy? what range?) that the language refuses to hide behind a simple-looking function call. If you need trig, you precompute tables in Python or write an explicit FMA polynomial chain.
This turns out to be the key insight for CQT.
The Design
A CQT with 84 bins (7 octaves, 12 semitones each) starting at C2 (65.4 Hz) at 44.1 kHz sample rate has these properties:
| Bin | Note | Frequency | Window Length |
|---|---|---|---|
| 0 | C2 | 65.4 Hz | 11,339 samples |
| 33 | A4 | 440 Hz | 1,684 samples |
| 83 | B8 | 7,903 Hz | 94 samples |
The quality factor Q = 16.8 (constant across all bins — that's what "constant-Q" means). Total work per frame: 200,476 FMA operations across all bins.
The approach: precompute cos/sin twiddle factor tables in Python (one-time cost, 1.6 MB), then let Eä handle the per-frame FMA loop. We bake the Hann window directly into the twiddle factors during precomputation:
# Python: precompute once
for k in range(n_bins):
n_k = int(lengths[k])
i = np.arange(n_k, dtype=np.float64)
window = 0.5 * (1 - np.cos(2 * np.pi * i / n_k))
angle = 2 * np.pi * freqs[k] * i / SAMPLE_RATE
cos_parts.append((window * np.cos(angle)).astype(np.float32))
sin_parts.append((window * -np.sin(angle)).astype(np.float32))
This means the Eä kernel needs zero trig and zero windowing logic. The inner loop is pure FMA.
The Kernel
Version 1: Direct DFT with Dual Accumulators
The first kernel processes one frequency bin at a time. For each bin, it reads the audio segment, multiplies against precomputed cos and sin twiddle factors, accumulates, and computes the magnitude. Smoothing (exponential decay for the "falling bars" effect) is fused into the output — no separate kernel call, no intermediate array.
Two independent FMA accumulator chains (r0/r1, i0/i1) hide pipeline latency on superscalar CPUs. This is the same technique used in high-performance reduction kernels.
export func cqt_fused(
audio: *f32,
cos_table: *f32,
sin_table: *f32,
offsets: *i32, // per-bin start offset into twiddle tables
lengths: *i32, // per-bin window length
prev_mags: *f32,
out_mags: *mut f32,
alpha: f32, // decay factor
n_bins: i32,
max_window_len: i32
) {
let mut k: i32 = 0
while k < n_bins {
let off: i32 = offsets[k]
let win_len: i32 = lengths[k]
let audio_start: i32 = max_window_len - win_len
let mut r0: f32x8 = splat(0.0)
let mut r1: f32x8 = splat(0.0)
let mut i0: f32x8 = splat(0.0)
let mut i1: f32x8 = splat(0.0)
let mut i: i32 = 0
while i + 16 <= win_len {
prefetch(cos_table, off + i + 64)
prefetch(sin_table, off + i + 64)
let s0: f32x8 = load(audio, audio_start + i)
let s1: f32x8 = load(audio, audio_start + i + 8)
let c0: f32x8 = load(cos_table, off + i)
let c1: f32x8 = load(cos_table, off + i + 8)
let sn0: f32x8 = load(sin_table, off + i)
let sn1: f32x8 = load(sin_table, off + i + 8)
r0 = fma(s0, c0, r0)
r1 = fma(s1, c1, r1)
i0 = fma(s0, sn0, i0)
i1 = fma(s1, sn1, i1)
i = i + 16
}
// 8-element and scalar tails omitted for brevity
let real: f32 = reduce_add(r0 .+ r1)
let imag: f32 = reduce_add(i0 .+ i1)
let mag: f32 = sqrt(real * real + imag * imag)
// Fused smooth decay
let decayed: f32 = prev_mags[k] * alpha
if mag > decayed {
out_mags[k] = mag
} else {
out_mags[k] = decayed
}
k = k + 1
}
}
The generated assembly for the hot loop is clean — the Eä compiler folds twiddle loads directly into vfmadd231ps memory operands, so the audio loads (vmovups) are the only explicit loads:
.LBB0_4:
prefetcht0 (%rsi,%r12,4) ; prefetch cos_table
prefetcht0 (%rdx,%r12,4) ; prefetch sin_table
vmovups (%rdi,%r12,4), %ymm5 ; load audio[i..i+7]
vmovups (%rdi,%r12,4), %ymm6 ; load audio[i+8..i+15]
vfmadd231ps (%rsi,%r12,4), %ymm5, %ymm3 ; r0 += audio * cos
vfmadd231ps (%rsi,%rbp,4), %ymm6, %ymm4 ; r1 += audio * cos
vfmadd231ps (%rdx,%r12,4), %ymm5, %ymm2 ; i0 += audio * sin
vfmadd231ps (%rdx,%rbp,4), %ymm6, %ymm1 ; i1 += audio * sin
Four FMAs per iteration, each operating on 8 floats. 64 floating-point multiply-adds per loop cycle.
First Benchmark: The Lie
Our first benchmark compared this against a Python for-loop CQT:
def numpy_cqt(audio, freqs, max_window_len):
magnitudes = np.empty(N_BINS, dtype=np.float32)
for k in range(N_BINS):
n_k = int(np.ceil(Q * SAMPLE_RATE / freqs[k]))
segment = audio[start:start + n_k]
window = 0.5 * (1 - np.cos(2 * np.pi * i / n_k))
kernel = window * np.exp(-2j * np.pi * freqs[k] * i / SAMPLE_RATE)
magnitudes[k] = np.abs(np.sum(segment * kernel))
return magnitudes
Result: 153x faster. We almost shipped this number.
The problem: this "NumPy CQT" is a Python for-loop with per-bin allocations. Nobody would write production CQT this way. The honest NumPy approach is to precompute a (84, 11339) complex kernel matrix and do a single np.dot:
# Precompute once: (n_bins, max_window_len) complex64 matrix
kernel_matrix = np.zeros((N_BINS, max_window_len), dtype=np.complex64)
for k in range(N_BINS):
# ... fill in windowed complex exponentials, zero-pad shorter bins
# Per frame: single BLAS call
magnitudes = np.abs(kernel_matrix @ audio)
This calls directly into OpenBLAS/MKL — decades of hand-tuned assembly for matrix-vector multiply.
Honest Benchmark v1
Methodology:
- All competitors precompute everything they can (tables, windows, indices)
- Only per-frame work is timed: transform + magnitude + smoothing
- Same smoothing (
max(current, previous * alpha)) applied to all
With N=500 iterations and inadequate warmup:
| Time | |
|---|---|
| Eä CQT | 0.098 ms |
| NumPy matmul | 0.085 ms |
NumPy was winning. The 153x headline collapsed to Eä being 1.1x slower.
Fixing the Benchmark
The N=500 result was noisy. WSL2 scheduling and insufficient warmup made the numbers unreliable. We moved to a min-of-trials methodology: 10 independent trials of 2,000 iterations each, reporting both min (best-case, no OS interference) and median (real-world scheduling).
With proper warmup (200+ iterations before timing):
| Min | Median | |
|---|---|---|
| Eä CQT | 50 us | 55 us |
| NumPy matmul | 95 us | 190 us |
Eä was already 1.9x faster at best, 3.5x at median. The initial "1.1x slower" was just noise.
Optimization Attempts
Quad Accumulators
Theory: 4 independent FMA chains (8 accumulator registers) should hide more pipeline latency than 2 chains.
Result: slower (100 us vs 83 us). With 8 YMM accumulator registers + 12 data registers for loads, we exceed x86-64's 16 YMM register limit. The compiler spills to stack memory, killing the very latency hiding we wanted.
Interleaved Loads
Theory: load cos, immediately FMA, then load sin and FMA. Give the memory subsystem more time to fetch by interleaving loads with compute.
Result: slower (191 us vs 53 us). The original pattern — batch all loads, then batch all FMAs — is better for out-of-order execution. The CPU's reorder buffer can schedule loads far ahead of their consumers when they're grouped together.
Merged Twiddle Table
Theory: interleave cos and sin data for each bin into a single contiguous memory region [cos_bin0 | sin_bin0 | cos_bin1 | ...] to halve cache line traffic.
Result: no change (53.8 us vs 53.1 us). The hardware prefetcher already handles two sequential streams efficiently. The second stream (sin_table) gets prefetched in parallel with the first (cos_table) because the access pattern is identical — sequential scan with the same stride.
In-Place Smoothing
Instead of writing to out_mags and copying back to prev_mags in Python, the kernel reads and writes the same buffer:
export func cqt_inplace(
audio: *f32,
cos_table: *f32,
sin_table: *f32,
offsets: *i32,
lengths: *i32,
mags: *mut f32, // read previous, write smoothed result
alpha: f32,
n_bins: i32,
max_window_len: i32
)
Result: ~5% faster (47.8 us vs 50.2 us). Modest win from eliminating a 336-byte numpy copy per frame plus one Python function call.
Final Numbers
Methodology: 10 trials of 2,000 iterations each, 200-iteration warmup per trial, min-of-trials reported. All precomputation excluded. All competitors include smoothing. NumPy uses pre-allocated output buffers (out= parameter) to avoid allocation overhead.
| Min | Median | |
|---|---|---|
| Eä CQT (fused SIMD) | ~50 us | ~55 us |
| NumPy CQT (BLAS matmul) | ~95 us | ~190 us |
| NumPy FFT (wrong result) | ~106 us | ~118 us |
Eä vs BLAS: 1.9x faster (min), 3.5x (median) Eä vs FFT: 2.1x faster — and FFT gives the wrong answer Throughput: 7.8 GFLOP/s
Why Eä Wins
1. Eä reads 4.8x less data. The BLAS matmul uses a dense (84, 11339) complex64 matrix — 7.4 MB per frame, including all the zero-padded regions where shorter bins have no data. Eä's variable-length inner loops skip zeros entirely, touching only 200,476 real elements (1.6 MB).
2. Real vs complex arithmetic. BLAS operates on complex64 (pairs of float32). Every BLAS FMA processes a complex multiply-add — 4 real multiplies and 2 real adds per element. Eä works directly in float32, doing 2 real FMAs per element (one for the cos component, one for sin).
3. Fused pipeline.
Eä does window + transform + magnitude + smoothing in one function call. NumPy requires 4 separate calls: matmul, abs, maximum, copyto. Each call crosses the Python/C boundary, allocates or writes to buffers, and makes a separate pass over the output data.
4. Lower scheduling jitter. One ctypes call has less interrupt surface than four NumPy calls. This explains the median gap (3.5x) being larger than the min gap (1.9x) — Eä's single kernel call is less likely to be interrupted mid-computation.
Why Eä Doesn't Win More
The dual-accumulator inner loop is already close to optimal for AVX2. The generated assembly has 4 vfmadd231ps instructions per iteration with memory operands — the compiler is folding loads into FMAs. There's no instruction waste.
The bottleneck is memory bandwidth. At 50 us for 1.6 MB of twiddle table reads, we're pulling ~32 GB/s — respectable for L3 cache, but below the theoretical memory bandwidth. The per-bin overhead (loading offsets/lengths, setting up accumulators, horizontal reduction) fragments what could be one continuous stream into 84 separate sweeps.
BLAS doesn't have this problem. Its matmul is one uninterrupted sweep through a dense matrix, which modern CPUs and prefetchers are specifically optimized for. BLAS trades compute (touching zeros) for access pattern regularity — and on some runs, when the OS cooperates perfectly, that trade nearly pays off.
The Code
Three files. The full kernel (cqt.ea, 80 lines), the Python visualizer, and the benchmark. The kernel uses the language as intended: explicit types, explicit vector widths, explicit memory access. No hidden costs.
import ea
import numpy as np
kernel = ea.load("cqt.ea")
# Precompute tables once
freqs = F_MIN * 2 ** (np.arange(84) / 12)
lengths = np.ceil(Q * SAMPLE_RATE / freqs).astype(np.int32)
# ... build cos/sin twiddle tables with baked-in Hann window ...
# Per frame: one call, ~50 us
kernel.cqt_fused(
audio_buffer, cos_table, sin_table,
offsets, lengths, prev_mags, out_mags,
decay_alpha, n_bins, max_window_len,
)
The full source is on GitHub.
What We Learned
The first benchmark said 153x. That was a lie — comparing optimized SIMD against a Python for-loop. The second benchmark said 1.1x slower. That was noise — insufficient warmup on a noisy WSL2 scheduler. The real number is 1.9x faster, earned by reading less data, using simpler arithmetic, and fusing the pipeline.
Three optimization attempts failed (quad accumulators, interleaved loads, merged tables) and one gave a modest 5% win (in-place smoothing). The lesson: when the compiler is already generating clean assembly, micro-optimizations rarely help. The wins came from the algorithm design — variable-length windows, baked-in windowing, fused smoothing — not from squeezing the inner loop.
Eä didn't win by being a faster compiler. It won by making it easy to write the right algorithm. A kernel that skips zeros, fuses four operations, and makes one function call will beat a kernel that touches every element of a zero-padded matrix and makes four function calls — even when the latter is backed by BLAS.
Searching Encrypted Logs Without Decrypting Them (Much)
Or: what happens when you let a SIMD compiler loose on a crypto problem.
The Problem That Shouldn't Exist
You have 100 GB of encrypted log files. Something is on fire. You need to find every line containing "ERROR".
Here's your options:
Option A: The "correct" way. Decrypt the entire file to /tmp. Grep it. Delete the decrypted copy. Hope nobody read it while it was sitting there. Also hope your 50 GB /tmp partition can handle it. Also hope you remembered to shred instead of rm.
Option B: The galaxy brain way. Use Fully Homomorphic Encryption to search directly on ciphertext. Wait approximately until the heat death of the universe. Receive your answer. Realize the fire already consumed the building.
Option C: What we actually built. Decrypt 4 KB at a time into a tiny buffer. Search it. Zero the buffer. Move on. The plaintext exists for about as long as a TikTok attention span.
We went with Option C.
The "Fusion" Trick
Here's the thing about ChaCha20 — it's a stream cipher. You XOR a keystream with your plaintext, byte by byte. Decryption is the same operation: XOR the keystream with ciphertext, get plaintext back.
So why not XOR and search at the same time?
Standard pipeline:
decrypt(ciphertext) → plaintext_file → grep(plaintext_file) → results
Memory: 100 GB plaintext on disk. Two passes over data.
Fused pipeline:
for each 4 KB chunk:
decrypt → search → extract lines → zero buffer
Memory: 4 KB. One pass. Plaintext never on disk.
This is called loop fusion — cramming multiple operations into a single loop so you only read the data once. It's the same trick that makes a * b + c with FMA faster than doing a * b then + c separately. Except here we're fusing cryptographic decryption with string search.
And because we're writing in Eä — a language where you control the SIMD instructions directly — the "search" part isn't some scalar byte-by-byte comparison. It's the same algorithm glibc uses in memmem:
// For each 16-byte chunk of decrypted plaintext:
let bits: i32 = movemask(chunk .== splat(first_byte))
if bits == 0 {
skip // no match possible in these 16 bytes
}
// else: verify the candidate positions
vpcmpeqb + vpmovmskb. Compare 16 bytes at once, get a bitmask of hits. If the bitmask is zero, skip. That's it. Most chunks contain zero instances of the letter 'E', so most chunks are skipped entirely.
Multi-Needle: The Actually Useful Part
Searching for one string is cute. Searching for ["ERROR", "FATAL", "PANIC"] without decrypting three times — that's useful.
The v2 kernel takes multiple needles and OR:s their bitmasks:
bits = 0
for each unique first-byte:
bits = bits + movemask(chunk .== splat(first_byte))
if bits == 0:
skip // none of the needles start with any byte in this chunk
One decryption pass. Three patterns. All the matched lines extracted with \n-boundary detection using the same SIMD trick.
The result:
from eachacha import encrypt, search
key = bytes(range(32))
nonce = bytes(12)
ct = encrypt(b"INFO ok\nERROR disk full\nFATAL crash\n", key, nonce)
result = search(ct, [b"ERROR", b"FATAL"], key, nonce)
for i in range(result.match_count):
print(f"[{result.needle_ids[i]}] {result.lines[i]}")
# [0] b'ERROR disk full'
# [1] b'FATAL crash'
Three lines of Python. The search() function does the rest — picks the right kernel, allocates buffers, packs needles, calls the SIMD kernel, returns structured results.
The Numbers
AMD EPYC 9354P, 2 vCPUs, 64 MB test data:
Single needle:
| Implementation | GB/s |
|---|---|
| Eä fused decrypt+search | 1.28 |
| Decrypt then C memmem (two passes) | 0.96 |
| C memmem on plaintext (no decrypt) | 2.22 |
Three needles:
| Implementation | GB/s |
|---|---|
| Eä v2 multi-needle (one pass) | 0.52 |
| Eä v1 called 3 times (three passes) | 0.41 |
| C memmem x3 on plaintext | 0.78 |
The fused approach is 1.34× faster than decrypt-then-search for single needle, and 1.28× faster than three separate searches for multi-needle. Not because our search is faster — it's because we only decrypt once.
And yes, 0.52 GB/s on encrypted data is 67% of what C memmem achieves on plaintext. We're paying the ChaCha20 tax, but the fusion rebate is generous.
What We're NOT Claiming
Let's be honest about the security model, because someone on Hacker News will definitely yell at me if I'm not:
| What we guarantee | What we don't claim |
|---|---|
| Plaintext never on disk | "Zero RAM exposure" |
| 4 KB buffer, zeroed per window | Equivalent to FHE |
| Only match offsets + lines leave the kernel | Side-channel resistance |
| 400-million-fold reduction in exposure surface | That this is a good idea for nuclear launch codes |
This is a practical middle ground between "decrypt everything to a temp file" (what everyone does today) and "compute on ciphertext" (what nobody can afford to do today). It's not cryptographically novel. It's an engineering choice: minimize exposure, maximize throughput, accept the 4 KB window.
Also, no special hardware required. No SGX enclaves. No trust assumptions about chip manufacturers. Just software and a CPU that can XOR things fast.
The Autoresearch Detour
We tried to make it faster. The Eä project has an automated optimization loop (autoresearch) that iteratively modifies kernels and benchmarks them:
Iteration 1: Remove buffer zeroing → 0% improvement (zeroing isn't the bottleneck)
Iteration 2: unroll(10) on round loops → parse error (oops)
Iteration 3: 2-block ILP instead of 4-block → 24% SLOWER
Iteration 4: Inline rotation functions → 5% slower (LLVM already inlines them)
Iteration 5: Timeout (the kernel is 866 lines, agent couldn't finish)
Total improvement from 5 iterations: 0%.
The kernel is compute-bound. ChaCha20's 20 rounds of quarter-round operations dominate. The search and line extraction are fast — the bottleneck is the math. Which is honestly a good place to be: it means we're not leaving performance on the table.
2,098 Lines
That's the total across all four kernels:
| Kernel | Lines | What it does |
|---|---|---|
chacha20.ea | 272 | Encrypt (1.78 GB/s) |
chacha20_fused.ea | 384 | Encrypt + stats in one pass |
chacha20_search.ea | 576 | Single-needle encrypted search |
chacha20_search_v2.ea | 866 | Multi-needle + context lines |
For comparison, OpenSSL's ChaCha20 implementation is somewhere north of 100,000 lines of C and assembly. Ours passes the same RFC 7539 test vectors and cross-verifies byte-for-byte with OpenSSL.
I'm not saying we're better than OpenSSL. OpenSSL in native C with AVX-512 would destroy us. But we're 272 lines that a human can read in 10 minutes, and we still hit 1.78 GB/s. That's the trade-off.
Try It
pip install eachacha
from eachacha import encrypt, search
key = bytes(range(32))
nonce = bytes(12)
ct = encrypt(b"your secret logs here", key, nonce)
result = search(ct, b"secret", key, nonce)
print(result.offsets) # [5]
Or build from source for native CPU optimization:
pip install ea-compiler
git clone https://github.com/petlukk/eachacha
cd eachacha && ./build.sh && pip install -e .
Benchmarks measured on 2 vCPUs. Your numbers will be different, probably better. The ratios hold. Code, benchmarks, and 109 tests on GitHub.
Eä is open-source. The compiler is ~12,000 lines of Rust. Docs. GitHub.
How 30 Lines of Eä Beat NumPy by 6×
And why your framework is probably slower than a for-loop.
The Pitch
Here's a fused multiply-add: out[i] = a[i] * b[i] + c[i]. Sixteen million times.
NumPy does it in 46 milliseconds. Eä does it in 7 milliseconds. That's 6.6× faster.
Here's the kicker: the Eä kernel is 30 lines. The NumPy version is one line. And the one-liner loses because it's actually two lines pretending to be one.
The NumPy Version
out = a * b + c
Simple. Elegant. Two full scans of 64 MB of data.
NumPy computes a * b first, writes 64 MB to a temporary array, then reads it back to add c. That's 256 MB of memory traffic for 192 MB of actual data. Every element gets loaded, stored, loaded again, and stored again.
On a modern CPU at ~35 GB/s memory bandwidth, that's a floor of about 7 milliseconds just for memory. NumPy hits 46 ms because it also has Python dispatch overhead, array allocation, and — crucially — it can't fuse the multiply and add into a single instruction.
The Eä Version
export func fma_f32x8(
a: *restrict f32,
b: *restrict f32,
c: *restrict f32,
out: *mut f32,
n: i32
) {
let mut acc0: f32x8 = splat(0.0)
let mut acc1: f32x8 = splat(0.0)
let mut i: i32 = 0
while i + 16 <= n {
let va0: f32x8 = load(a, i)
let vb0: f32x8 = load(b, i)
let vc0: f32x8 = load(c, i)
store(out, i, fma(va0, vb0, vc0))
let va1: f32x8 = load(a, i + 8)
let vb1: f32x8 = load(b, i + 8)
let vc1: f32x8 = load(c, i + 8)
store(out, i + 8, fma(va1, vb1, vc1))
i = i + 16
}
while i < n {
out[i] = a[i] * b[i] + c[i]
i = i + 1
}
}
One pass. Load a, b, c, fuse multiply-add in a single vfmadd instruction, store result. Each element is touched once. Total memory traffic: 256 MB (same 4 arrays), but only one scan instead of two.
The *restrict tells LLVM the arrays don't alias, enabling aggressive optimization. The f32x8 processes 8 floats per instruction on AVX2. The 2× unroll (processing 16 elements per iteration) hides memory latency.
That's it. No magic. Just not doing twice the work.
"But I'll Just Use Ray"
We benchmarked that too. For science.
| Method | Time | Throughput | vs NumPy |
|---|---|---|---|
| NumPy (multiply + add) | 46,000 µs | 5.6 GB/s | baseline |
| Eä (1 thread) | 6,900 µs | 37.0 GB/s | 6.6× |
| Eä (2 threads) | 6,500 µs | 39.1 GB/s | 7.0× |
| Dask (2 chunks) | 56,000 µs | 4.6 GB/s | 0.8× |
| Ray (2 workers) | 89,000 µs | 2.9 GB/s | 0.5× |
Ray is twice as slow as NumPy. For FMA. On two cores. On the same machine.
Why? Serialization. Ray pickles your 64 MB arrays, sends them to worker processes, unpickles them, runs NumPy (which does two passes), pickles the results, and sends them back. The actual compute takes 46 ms. The overhead takes another 43 ms.
Dask is better — it chunks lazily and uses NumPy under the hood — but it still can't fuse operations or control SIMD width. It's paying for an abstraction layer over code that's already fast enough.
The uncomfortable truth: for single-machine numerical work, a for-loop that touches memory once will beat a distributed framework that touches memory twice.
The Dot Product Story
We weren't always winning. Our first benchmark had Eä's dot product at 0.27× of BLAS. Embarrassing.
// The naive version — DO NOT ship this
export func dot_naive(a: *f32, b: *f32, n: i32) -> f32 {
let mut acc: f32 = 0.0
let mut i: i32 = 0
while i < n {
acc = acc + a[i] * b[i]
i = i + 1
}
return acc
}
Single scalar accumulator. Each iteration depends on the previous one. The CPU stalls waiting for the addition to complete before it can start the next multiply. Classic dependency chain bottleneck.
BLAS uses the trick every HPC programmer knows: multiple independent accumulators.
// The optimized version — ships in autoresearch/kernels/dot_product/
export func dot_f32x8(a: *restrict f32, b: *restrict f32, len: i32) -> f32 {
let mut acc0: f32x8 = splat(0.0)
let mut acc1: f32x8 = splat(0.0)
let mut i: i32 = 0
while i + 32 <= len {
acc0 = fma(load(a, i), load(b, i), acc0)
acc1 = fma(load(a, i + 8), load(b, i + 8), acc1)
acc0 = fma(load(a, i + 16), load(b, i + 16), acc0)
acc1 = fma(load(a, i + 24), load(b, i + 24), acc1)
i = i + 32
}
// ... tail handling ...
return reduce_add_fast(acc0 .+ acc1)
}
Two f32x8 accumulators (acc0, acc1) with 4× unroll. The CPU has 16 FMA operations in flight that don't depend on each other. The result:
| Method | Time | GB/s | vs BLAS |
|---|---|---|---|
| NumPy BLAS sdot | 3,535 µs | 35.9 | baseline |
| Eä naive (scalar) | 13,222 µs | 9.7 | 0.27× |
| Eä f32x4 (1 acc) | 4,474 µs | 28.6 | 0.79× |
| Eä f32x8 (dual acc, 4× unroll) | 3,500 µs | 36.6 | 1.01× |
From 0.27× to 1.01× — by changing the loop structure, not the algorithm. Both versions compute the same dot product. The fast one just asks the CPU to think about 32 elements instead of 1.
What You're Actually Seeing
This isn't "Eä is fast." Eä is a thin wrapper around LLVM. The FMA kernel compiles to about 12 assembly instructions in the inner loop. Eä's job is to make writing those 12 instructions feel like writing 30 lines of readable code instead of 200 lines of intrinsics.
The real insight is about NumPy's cost model:
| Operation | NumPy | Eä |
|---|---|---|
a * b + c | 2 passes over data | 1 pass (fused FMA) |
| Temporary arrays | 1 allocation (64 MB) | 0 allocations |
| SIMD width | "whatever the compiler picks" | explicit f32x8 |
| Memory round-trips | 2 (multiply result → RAM → add) | 0 |
NumPy is fast per operation. But real workloads aren't one operation — they're pipelines. And every np. call is a full scan of your data.
When Eä Doesn't Win
Bandwidth-bound operations with a single arithmetic op per element. Scaling an array:
out = src * 2.5
NumPy: 5,631 µs. Eä: 6,138 µs. NumPy wins (by 8%).
Both saturate memory bandwidth at ~22 GB/s. There's nothing to fuse. One multiply per element loaded. No amount of SIMD cleverness makes DRAM faster.
This is the honest dividing line: if your inner loop does fewer than ~4 operations per element, NumPy is fast enough. If it does more, you're leaving performance on the table.
Getting Started
pip install ea-compiler
import ea
kernel = ea.load("fma.ea")
import numpy as np
a = np.random.rand(16_000_000).astype(np.float32)
b = np.random.rand(16_000_000).astype(np.float32)
c = np.random.rand(16_000_000).astype(np.float32)
out = np.zeros(16_000_000, dtype=np.float32)
kernel.fma_f32x8(a, b, c, out) # 7 ms instead of 46 ms
No Cython. No Numba. No C compiler. No JIT warmup. Write a .ea file, call ea.load(), pass NumPy arrays.
The compiler handles SIMD, the binding handles types, and the hardware handles the rest.
All benchmarks measured on a 2-core machine. Your numbers will be different, probably better (more cores = more bandwidth). The ratios hold. Source code and methodology on GitHub.
Eä is open-source (Apache 2.0). The compiler is ~12,000 lines of Rust with 475+ tests. Documentation. GitHub.
I Wrote a SIMD Compiler in 12K Lines of Rust
And then an LLM optimized the kernels better than I could.
The Elevator Pitch
Eä is a compiler for SIMD kernels. You write a small .ea file, run one command, and call it from Python like a normal function. But it runs at native vectorized speed.
import ea
kernel = ea.load("fma.ea")
result = kernel.fma_f32x8(a, b, c, out) # 6.6× faster than NumPy
No ctypes. No header files. No build system. The compiler generates the shared library and the Python wrapper. Also Rust, C++, PyTorch, and CMake bindings. It targets x86-64 (AVX2/AVX-512) and AArch64 (NEON).
The whole compiler is 12,000 lines of Rust. 475 tests. One person. I'm not a compiler engineer. But I had a very specific problem, and it turns out that's enough.
Why
I had a problem that kept repeating. I'd write something in Python, profile it, find a hot loop, and think: "this needs to be fast." And I knew that fast meant C. Fast meant SIMD. I don't have deep experience with either, but I knew that's where the performance lives.
So I'd fumble through some C code (with an LLM helping me write it, which honestly made it worse because now the code worked but I didn't fully understand why), fight with ctypes, spend an afternoon debugging pointer arithmetic that the AI got right the first time but I broke while "cleaning up," and eventually get a 5× speedup. Then next week, different project, same dance.
I didn't mind the hard part. Figuring out what the kernel should do, thinking about memory access patterns, deciding on vector widths. That's the interesting problem. What I minded was the plumbing. The header files. The build system. The ctypes declarations. The dtype validation. All of it boilerplate, all of it error-prone, none of it the actual work.
So I thought: what if a compiler could handle the plumbing? You write the kernel in a simple language, something that looks like the pseudocode you'd sketch on a whiteboard, and the compiler handles everything else. Compile to a shared library. Auto-generate the Python wrapper. One command. No Makefile.
The "no glue code" part turned out to be the product. Not the SIMD. Not the compiler. The fact that you go from .ea file to working Python function in one command, with types checked, lengths inferred, and output buffers allocated. That's what makes people actually use it instead of just reading about it.
I didn't know how to build a compiler. But I had the idea, and I wanted to see if it would work.
The First Attempt (10K Lines of Pain)
I should tell you about the compiler I wrote before this one.
It also targeted LLVM. It also generated SIMD code. The codegen was 10,000 lines in a single file. The parser was hand-written but handled way too many features. There were no hard limits on file size, no style rules, no test discipline. I kept adding things (generics, a module system, type inference) because why not?
The codebase became unmaintainable in about three months. I couldn't change anything without breaking something else. The codegen file had functions that called functions that called functions eight levels deep, and half of them were handling edge cases for features nobody used.
I threw it away.
The lesson: a SIMD kernel compiler doesn't need generics. It doesn't need modules. It doesn't need type inference. It needs to compile load, store, fma, and splat correctly, generate clean bindings, and stay small enough that one person can hold it in their head.
The Hard Rules
For the second attempt, I wrote the rules before I wrote the code:
- No file exceeds 500 lines. Split before you hit the limit.
- Every feature proven by end-to-end test. If it's not tested, it doesn't exist.
- No fake functions. If hardware doesn't support an operation, the compiler errors. No silent fallbacks.
- No premature features. Don't build what isn't needed yet.
- Delete, don't comment. Dead code gets removed.
The 500-line rule was the hardest to follow and the most valuable. It forced me to split the type checker into 7 files, the codegen into 10 files, and the parser into 4 files. Each file does one thing. When I need to change how store is type-checked, I open intrinsics_memory.rs (309 lines) and the answer is right there. No grepping through 10K lines of spaghetti.
Was it frustrating? Constantly. I'd be in the middle of adding a feature, hit 480 lines, and have to stop and refactor before I could finish. But every time I did, the code got better. The refactor always revealed something. A responsibility that should have been split earlier. A function that was doing two things.
The "no premature features" rule was the other hard one. I kept wanting to add generics. Or a module system. Or traits. And every time, I'd ask myself: does this serve the goal of compiling SIMD kernels to callable shared libraries? The answer was always no. Eä is monomorphic by design. You write f32x8, you get f32x8. No hidden specialization, no surprise codegen, no combinatorial explosion of type instances.
It's not a limitation. It's the point.
The Architecture
.ea → Lexer (logos) → Parser → Desugar → Type Check → Codegen (LLVM 18) → .o / .so
→ .ea.json → ea bind
The most important insight: the desugarer is the most important pass.
Eä has a kernel construct that looks like this:
export kernel vscale(data: *f32, out: *mut f32, factor: f32)
over i in n step 8
tail scalar { out[i] = data[i] * factor }
{
store(out, i, load(data, i) .* splat(factor))
}
The desugarer turns this into a plain function with a while-loop and a tail loop. After desugaring, there are no kernels in the AST. Just functions with loops. The type checker and codegen never see kernel. They only see func.
This means every downstream pass is simpler. The type checker doesn't need special kernel logic. The codegen doesn't need to handle iteration. The desugarer handles all of it: injecting the n parameter, generating the loop variable, building the main loop with step, building the tail loop with the chosen strategy.
The desugar pass is 340 lines. It eliminates an entire class of complexity from the remaining 11,660 lines.
The Binding Generators
This is the part people don't expect. The compiler generates .ea.json metadata describing each exported function's signature. Then ea bind reads the JSON and generates idiomatic wrappers:
ea bind kernel.ea --python --rust --cpp --pytorch --cmake
The Python generator does something clever: length collapsing. If your kernel takes (data: *f32, n: i32), the generated Python function takes just data and fills n from data.size automatically. Output parameters marked with out get auto-allocated. The generated code checks dtypes, casts pointers, and handles all the ctypes plumbing that used to take me an afternoon.
Five binding generators, each 200-460 lines. No serde. The JSON parser is hand-written (65 lines). The generated code is clean enough that you can read it, modify it, and learn from it.
Error Messages (The Quiet Win)
I spent more time on error messages than on codegen.
The compiler has "did you mean?" suggestions with Levenshtein distance, dot-operator hints ("cannot use '+' on vectors, use '.+'"), let mut suggestions, type conversion hints, multi-character underlines showing the full expression span, and clear messages that never leak internal compiler state.
kernel.ea:5:12 error[type]: cannot use '+' on vectors. Use '.+' for element-wise vector operations
return a + b
^^^^^
Nobody notices good error messages. But everyone notices bad ones. The difference between a user who gives up and a user who fixes their code is often one helpful error message.
The Autoresearch System (The Fun Part)
This is where it gets interesting.
Inspired by Andrej Karpathy's autoresearch concept, I built an automated optimization loop: an LLM reads the kernel source, the benchmark results, and the history of what's been tried, then proposes a modified kernel. The system compiles it, benchmarks it across multiple data sizes (to catch cache-fitting illusions), verifies correctness against a C reference, and accepts or rejects the change. Then it iterates.
This is where I had the most fun building Eä.
The first time I ran it on the FMA kernel, it found a 10% improvement in 30 iterations. I thought the kernel was already as good as it could get. The LLM found that 12× unrolling with stream stores beat 4× unrolling with regular stores at DRAM scale. I wouldn't have tried that. It sounds like overkill.
Then I let it run on the matrix multiplication kernel. 56% improvement. It switched from ijk to ikj loop order with 8× k-unrolling. I've heard of loop tiling. I couldn't have told you when to apply it. The LLM didn't need to "know." It just tried it and the benchmark said yes.
The thing that surprised me most: you think you have an optimal kernel. You let the LLM iterate 5 times and it finds 20% improvement. Okay, fine, maybe it wasn't optimal. So you let it iterate 50 times on the already-improved kernel. And it still finds improvements. The search space for kernel optimization is bigger than your intuition.
27 benchmark kernels, all scored on largest-size (real-world) data with GB/s bandwidth metrics. The system includes bottleneck classification that tells the LLM whether a kernel is DRAM-bound (don't bother with compute tricks), compute-bound (try wider SIMD, more accumulators), or mixed. The biggest wins:
| Kernel | Improvement | What Changed |
|---|---|---|
| Bitonic sort | 97% | Replaced O(n²) Shellsort with sorting network |
| Matmul | 56% | k×8 unroll, cache-friendly access |
| Conv2d 3×3 | 47% | 4× column unroll, prefetch, restrict |
| Edge detect | 41% | f32x4 to f32x8 upgrade |
The humbling part: I'm not an optimization expert. But it turns out you don't need to be one. You need a benchmark harness, a correctness check, and a system that's willing to try things you wouldn't think of.
The Numbers
Source: 12,000 lines of Rust
Tests: 475 end-to-end
Test method: compile Eä → link with C → run binary → compare stdout
CI: x86-64, AArch64 (native), Windows
LLVM backend: 18.1 via inkwell 0.8
Binding targets: Python, Rust, C++, PyTorch, CMake
Performance on a real workload (16M float32 elements):
FMA (fused multiply-add): 6.6× faster than NumPy 37.0 GB/s
Dot product: matches BLAS 36.6 GB/s
SAXPY: 2.1× faster than NumPy 35.2 GB/s
That's ~37 GB/s on a system with ~40 GB/s memory bandwidth. Near the hardware limit. There's not much room left, which means the code is doing roughly as little unnecessary work as possible.
What I'd Do Differently
Start with the binding generator. I built the compiler first and added bindings later. But the bindings are what make Eä useful. If I'd started by designing the ideal Python API and worked backward to the compiler, some early decisions would have been different.
Add ea inspect earlier. The instruction analysis tool that shows you vector/scalar instruction counts, FMA operations, load/store ratio, and performance hints. I added it late, but it would have caught optimization issues months earlier.
Write fewer features, sooner. Eä has kernel, foreach, for, while, structs, output annotations, conditional compilation, static assertions, and 30+ intrinsics. Most users need kernel, load, store, fma, and splat. I should have shipped a useful subset earlier and iterated based on real usage.
The Real Story
I'm not a compiler engineer. I don't have a CS degree. I'm the kind of person who has ideas and wants to see if they work.
What changed is the tooling. I built Eä with the help of AI models. Claude for the heavy lifting, my own judgment for the architecture and design decisions. The hard rules came from me (learned the painful way from the first attempt). The implementation speed came from having a capable coding assistant.
A year ago, this would have required a team. Now it's 12,000 lines and one person.
The interesting question isn't "can you build this?" anymore. It's what are you going to build next?
My advice: if you have an idea that feels too ambitious for one person, the calculus has changed. Try it. Set hard rules so the codebase stays manageable. Write tests for everything. And don't be afraid to throw away your first attempt.
I threw away 10K lines of bad compiler and started over. Best decision I made.
Eä is open-source under Apache 2.0. GitHub · Documentation · pip install ea-compiler