Make Python Fast in 60 Seconds
Eä is a compute kernel compiler. You write small, focused compute functions — Eä compiles them to native code and gives you Python functions that take NumPy arrays.
No C. No Cython. No Numba JIT warmup. Just fast code.
The pitch
pip install ea-compiler
Write a kernel (scale.ea):
export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
let mut i: i32 = 0
while i < n {
dst[i] = src[i] * factor
i = i + 1
}
}
Call it from Python:
import ea
import numpy as np
kernel = ea.load("scale.ea")
src = np.random.randn(1_000_000).astype(np.float32)
dst = np.empty_like(src)
kernel.scale(src, dst, factor=2.0)
That's it. ea.load() compiles your kernel to a native shared library, caches it, and gives you a callable Python function. No Rust, no LLVM, no build step — just pip install and go.
What makes Eä different?
- Explicit SIMD. You control the vector width. No auto-vectorization guessing.
- No runtime. Compiles to plain
.so/.dllfiles. Zero overhead beyond function call. - No magic. If your code looks like it processes 8 floats at a time, it does. No silent scalar fallback.
- Instant bindings. Python, Rust, C++, PyTorch — generated from one source.
Next steps
Installation
From pip (recommended)
pip install ea-compiler
This gives you the ea compiler and the Python ea.load() API. No other dependencies needed (besides NumPy).
Works on:
- Linux x86_64
- Linux aarch64 (ARM)
- Windows x86_64
Verify installation
import ea
print(ea.__version__) # e.g., "1.7.0"
print(ea.compiler_version()) # same, from the bundled binary
Building from source
For development or unsupported platforms, see the eacompute README for instructions on building the compiler from source. This requires Rust and LLVM 18.
Your First Kernel
Let's write a kernel that scales an array by a constant factor, then upgrade it to use SIMD.
Step 1: Scalar version
Create scale.ea:
export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
let mut i: i32 = 0
while i < n {
dst[i] = src[i] * factor
i = i + 1
}
}
Key things to notice:
exportmakes the function callable from Python (C ABI)*f32is an immutable pointer to float32 — your input array*mut f32is a mutable pointer — your output arrayn: i32is the array length — the caller provides this- All types are explicit. No inference, no ambiguity.
Step 2: Call from Python
import ea
import numpy as np
kernel = ea.load("scale.ea")
src = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32)
dst = np.empty_like(src)
kernel.scale(src, dst, factor=2.0)
print(dst) # [2. 4. 6. 8.]
ea.load() compiles scale.ea the first time, then caches the result in __eacache__/. Subsequent calls load from cache instantly.
Step 3: SIMD version
Now let's process 8 floats at a time:
export func scale_simd(src: *f32, dst: *mut f32, factor: f32, n: i32) {
let s: f32x8 = splat(factor)
let mut i: i32 = 0
while i < n {
let v: f32x8 = load(src, i)
store(dst, i, v .* s)
i = i + 8
}
}
What changed:
f32x8— a vector of 8 floats (256-bit, uses AVX2 on x86)splat(factor)— broadcasts the scalar to all 8 lanesload(src, i)— loads 8 consecutive floats starting at indexiv .* s— element-wise multiply (the dot prefix.means "vector operation")store(dst, i, ...)— writes 8 floats back- We increment by 8, not 1
Important: This only works when
nis a multiple of 8. For arbitrary lengths, see Kernels for tail-handling strategies.
Step 4: Compare performance
import ea
import numpy as np
import time
kernel = ea.load("scale.ea")
src = np.random.randn(10_000_000).astype(np.float32)
dst = np.empty_like(src)
# Eä scalar
start = time.perf_counter()
for _ in range(100):
kernel.scale(src, dst, factor=2.0)
ea_scalar = (time.perf_counter() - start) / 100
# Eä SIMD
start = time.perf_counter()
for _ in range(100):
kernel.scale_simd(src, dst, factor=2.0)
ea_simd = (time.perf_counter() - start) / 100
# NumPy
start = time.perf_counter()
for _ in range(100):
np.multiply(src, 2.0, out=dst)
numpy_time = (time.perf_counter() - start) / 100
print(f"Eä scalar: {ea_scalar*1000:.2f} ms")
print(f"Eä SIMD: {ea_simd*1000:.2f} ms")
print(f"NumPy: {numpy_time*1000:.2f} ms")
For this simple operation, all three will be similar — it's bandwidth-bound (one operation per element loaded). Eä shines on compute-bound workloads where there are multiple operations per element. See the Cookbook for real-world comparisons.
Why Ea
What Ea is
Ea is a compute kernel compiler. You write small, focused numerical routines in Ea's explicit syntax, compile them to native shared libraries (.so on Linux, .dll on Windows), and call them from Python, Rust, C++, or PyTorch via C ABI. Ea is not a general-purpose programming language. It has no standard library, no garbage collector, no runtime. It compiles kernels.
The problem
Python is slow for tight numerical loops. When you need to run a custom algorithm over millions of elements -- a stencil, a reduction with custom logic, a chain of fused multiply-adds -- you hit a wall. The usual options are:
- NumPy: fast for operations it supports, but custom multi-step algorithms require chaining calls that allocate intermediates and re-scan memory.
- Numba: JIT-compiles Python, but debugging is difficult, compilation is unpredictable, and SIMD vectorization is implicit (you hope the compiler figures it out).
- Cython: requires learning a hybrid language, managing build systems, and still gives you limited control over vectorization.
- Writing C extensions: full control, but high friction. Manual Python/C bridging, header files, build scripts.
Ea targets the gap: you want native-speed kernels with explicit SIMD, but you do not want to maintain a C build system.
The philosophy
Ea is built on one principle: explicit over implicit.
- If you write
f32x8, you get 8-wide SIMD. The compiler will not silently fall back to scalar code. - If hardware does not support an operation (e.g., scatter without AVX-512), the compiler errors. It does not emit slow scalar code behind your back.
- All types are explicit. No type inference. You see exactly what is happening.
- All memory is caller-provided. Ea kernels never allocate. Pointers come from the host language.
There are no hidden performance cliffs. The code you write is the code that runs.
When to use Ea
Ea excels at compute-bound workloads where you do significant work per element loaded from memory:
- Stencil operations: convolutions, blurs, edge detection -- each output element reads from multiple inputs.
- Fused multiply-add chains: polynomial evaluation, IIR filters, dot products.
- Custom reductions: computing statistics, finding patterns, accumulating results with non-trivial logic.
- Particle simulations: N-body interactions, force calculations.
- Image processing pipelines: per-pixel math with multiple operations fused into one pass.
The common thread: you load data, do many arithmetic operations on it, then store results. The CPU spends most of its time computing, not waiting for memory.
When NOT to use Ea
Ea is the wrong tool for:
- Bandwidth-bound workloads: if you are just adding two arrays element-wise, NumPy already saturates memory bandwidth. Ea cannot make memory faster.
- General programming: no strings, no file I/O, no networking, no data structures beyond structs.
- Prototyping: write your algorithm in Python first, profile it, then port the hot loop to Ea.
- GPU workloads: Ea targets CPUs (x86-64 with AVX2/AVX-512, AArch64 with NEON).
A good rule of thumb: if your inner loop does fewer than 4 arithmetic operations per element loaded, NumPy is probably fast enough.
The compilation model
kernel.ea --> ea --lib --> kernel.so + kernel.ea.json
|
ea bind --python
|
kernel.py (generated wrapper)
One .ea file is one compilation unit. There are no imports, no modules. If you need to compose kernels, you do it at the C level -- each kernel is an independent shared library with a C ABI entry point.
The generated .ea.json metadata file describes the function signatures. The ea bind command reads it to generate idiomatic wrappers for your target language.
Quick taste
Here is a complete Ea kernel that scales an array of floats using 8-wide SIMD:
export kernel vscale(data: *f32, out: *mut f32, factor: f32)
over i in n step 8
tail scalar {
out[i] = data[i] * factor
}
{
let vf: f32x8 = splat(factor)
store(out, i, load(data, i) .* vf)
}
Compile and use from Python:
ea kernel.ea --lib
ea bind kernel.ea --python
import numpy as np
from kernel import vscale
data = np.array([1.0, 2.0, 3.0, 4.0, 5.0], dtype=np.float32)
result = vscale(data, factor=3.0)
# result: [3.0, 6.0, 9.0, 12.0, 15.0]
The main body processes 8 elements at a time with SIMD. The tail scalar block handles any remainder elements one at a time. The n parameter (the array length) is automatically injected into the function signature from the over i in n clause.
Language Basics
This page covers Ea's scalar language features. For SIMD vector types and operations, see SIMD.
Scalar types
Ea has fixed-size numeric types. No type inference -- you always write the type explicitly.
| Type | Description |
|---|---|
i8, i16, i32, i64 | Signed integers |
u8, u16, u32, u64 | Unsigned integers |
f32, f64 | Floating point |
bool | Boolean (true / false) |
Variables
All variables must have an explicit type annotation. Variables are immutable by default.
let x: i32 = 5
let y: f32 = 3.14
let flag: bool = true
To make a variable mutable, use mut:
let mut counter: i32 = 0
counter = counter + 1
Constants
Compile-time constants use const:
const PI: f32 = 3.14159
const BATCH_SIZE: i32 = 256
const EPSILON: f64 = 1e-9
Constants can be used in static_assert for compile-time checks:
const STEP: i32 = 8
static_assert(STEP > 0, "step must be positive")
Arithmetic and comparison
Standard arithmetic operators work on all numeric types:
let a: i32 = 10 + 3 // 13
let b: i32 = 10 - 3 // 7
let c: i32 = 10 * 3 // 30
let d: i32 = 10 / 3 // 3 (integer division)
let e: i32 = 10 % 3 // 1 (remainder)
Comparison operators return bool:
let lt: bool = a < b
let gt: bool = a > b
let le: bool = a <= b
let ge: bool = a >= b
let eq: bool = a == b
let ne: bool = a != b
Logical operators
Ea uses words, not symbols, for logical operations:
let both: bool = a > 0 and b > 0
let either: bool = a > 0 or b > 0
let neither: bool = not (a > 0 or b > 0)
Control flow
if / else if / else
if x > 0 {
println(1)
} else if x == 0 {
println(0)
} else {
println(-1)
}
while loops
let mut i: i32 = 0
while i < n {
out[i] = data[i] * 2
i = i + 1
}
for loops
Counted loops with an explicit step:
for i in 0..n step 1 {
out[i] = data[i] * 2
}
The step is required. The range 0..n is half-open: it includes 0 but excludes n.
foreach loops
A simpler counted loop when the step is always 1:
foreach (i in 0..n) {
out[i] = data[i] * 2
}
Loop unrolling
Wrap a loop in unroll(N) to unroll it at compile time:
unroll(4) {
foreach (j in 0..4) {
out[base + j] = data[base + j] * factor
}
}
Functions
Functions are declared with func. All parameter and return types are explicit:
func square(x: f32) -> f32 {
return x * x
}
func add(a: i32, b: i32) -> i32 {
return a + b
}
Functions without a return type return nothing (void):
func fill(out: *mut i32, n: i32, val: i32) {
foreach (i in 0..n) {
out[i] = val
}
}
Exported functions
To make a function callable from C/Python/Rust, prefix it with export:
export func dot_product(a: *f32, b: *f32, n: i32) -> f32 {
let mut sum: f32 = 0.0
foreach (i in 0..n) {
sum = sum + a[i] * b[i]
}
return sum
}
Only export func (and export kernel) produce symbols visible from outside. Non-exported functions are internal helpers.
Pointers
Pointers are how kernels receive data from the host language. There are four pointer variants:
| Syntax | Meaning |
|---|---|
*T | Read-only pointer |
*mut T | Mutable pointer (can write through it) |
*restrict T | Read-only, no aliasing (enables optimizations) |
*restrict mut T | Mutable, no aliasing |
Pointer indexing
Read from a pointer with bracket indexing:
let val: f32 = data[i] // data is *f32
Write through a mutable pointer:
out[i] = val // out is *mut f32
Type casts
Explicit casts convert between numeric types:
let x: i32 = 42
let f: f32 = to_f32(x) // 42.0
let d: f64 = to_f64(x) // 42.0
let back: i32 = to_i32(f) // 42
let wide: i64 = to_i64(x) // 42
There are no implicit conversions. Mixing types without a cast is a compile error.
println
println is the only output primitive. It exists for debugging:
println(42)
println(3.14)
println(true)
println("hello")
It accepts integers, floats, bools, and string literals. It does not support format strings.
What does not exist
Ea is deliberately minimal. The following features do not exist and are not planned:
- No generics or templates
- No traits or interfaces
- No modules or imports
- No heap allocation
- No strings (except literal arguments to
println) - No semicolons (statements are newline-separated)
- No closures or lambdas
- No enums or pattern matching
- No exceptions or error handling
One file, one compilation unit. Compose at the C level.
SIMD
SIMD (Single Instruction, Multiple Data) lets you process multiple values in a single CPU instruction. Ea gives you direct control over SIMD vectors -- what you write is what the CPU executes.
Vector types
A vector type holds a fixed number of elements of the same scalar type. The name encodes both the element type and the lane count:
| Type | Elements | Bits | x86 requirement | ARM requirement |
|---|---|---|---|---|
f32x4 | 4 x f32 | 128 | SSE | NEON |
i32x4 | 4 x i32 | 128 | SSE | NEON |
u8x16 | 16 x u8 | 128 | SSE | NEON |
i8x16 | 16 x i8 | 128 | SSE | NEON |
i16x8 | 8 x i16 | 128 | SSE | NEON |
f64x2 | 2 x f64 | 128 | SSE2 | NEON |
f32x8 | 8 x f32 | 256 | AVX2 | not available |
i32x8 | 8 x i32 | 256 | AVX2 | not available |
u8x32 | 32 x u8 | 256 | AVX2 | not available |
f64x4 | 4 x f64 | 256 | AVX2 | not available |
f32x16 | 16 x f32 | 512 | AVX-512 | not available |
If you use f32x8 on a machine without AVX2, or on ARM, the compiler will error. No silent scalar fallback.
To enable 512-bit vectors, compile with --avx512:
ea kernel.ea --lib --avx512
Dot operators
Element-wise vector operations use dot-prefixed operators. This distinguishes them from scalar operations and makes SIMD explicit in the source:
| Operator | Meaning |
|---|---|
.+ | Element-wise add |
.- | Element-wise subtract |
.* | Element-wise multiply |
./ | Element-wise divide |
.> | Element-wise greater than (returns bool vector) |
.< | Element-wise less than |
.>= | Element-wise greater or equal |
.<= | Element-wise less or equal |
.== | Element-wise equal |
.!= | Element-wise not equal |
.& | Element-wise bitwise AND |
.| | Element-wise bitwise OR |
.^ | Element-wise bitwise XOR |
.<< | Element-wise shift left |
.>> | Element-wise shift right (logical for unsigned, arithmetic for signed) |
Example:
let a: f32x4 = load(data, i)
let b: f32x4 = load(data, i + 4)
let sum: f32x4 = a .+ b
let product: f32x4 = a .* b
let mask: f32x4 = a .> b
Creating vectors
splat
Broadcast a scalar value to all lanes:
let factor: f32 = 2.5
let vf: f32x8 = splat(factor)
// vf = [2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5]
Loading from memory
Load a vector from a pointer at an element offset:
let v: f32x8 = load(data, i) // loads data[i..i+8]
The offset i is in elements, not bytes. load(data, 8) loads elements 8 through 15.
There are also typed load intrinsics when you want to be explicit about the element type:
let v: f32x8 = load_f32(data, i)
let v: f32x4 = load_f32x4(data, i)
Storing to memory
Write a vector to a mutable pointer:
store(out, i, result) // writes result to out[i..i+8]
Element access
Read individual lanes from a vector by index:
let v: f32x4 = load(data, 0)
let first: f32 = v[0]
let second: f32 = v[1]
Conditional selection
select picks lanes from two vectors based on a mask:
let mask: f32x4 = a .> b
let result: f32x4 = select(mask, a, b)
// where a > b, take a; otherwise take b
This compiles to a single blend instruction. There is no branching.
Scalar vs SIMD comparison
Here is the same operation -- scaling an array -- written both ways.
Scalar (processes one element per iteration):
export func scale(data: *f32, out: *mut f32, factor: f32, n: i32) {
foreach (i in 0..n) {
out[i] = data[i] * factor
}
}
SIMD (processes 8 elements per iteration):
export kernel vscale(data: *f32, out: *mut f32, factor: f32)
over i in n step 8
tail scalar {
out[i] = data[i] * factor
}
{
let vf: f32x8 = splat(factor)
store(out, i, load(data, i) .* vf)
}
The SIMD version does 8 multiplications in a single instruction. On a workload that is compute-bound (multiple operations per element), this translates to a proportional speedup.
Hardware targeting
By default, Ea targets AVX2 on x86-64 and NEON on AArch64. You can change this:
# Default (AVX2 on x86)
ea kernel.ea --lib
# Enable AVX-512
ea kernel.ea --lib --avx512
# Target a specific CPU
ea kernel.ea --lib --target=skylake
# Cross-compile for ARM
ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu
The compiler rejects code that requires features the target does not have. If you use f32x8 and target ARM, you get a compile error, not slow code.
Kernels
The kernel construct is Ea's main abstraction for writing vectorized loops with automatic tail handling. It is syntactic sugar -- the compiler transforms it into a plain function with a while-loop before any further compilation.
Basic syntax
export kernel name(params)
over i in n step S
tail strategy { tail_body }
{
main_body
}
name: the kernel's name, becomes the C ABI symbolparams: function parameters (pointers, scalars)over i in n: the loop variableiiterates from 0 tonstep S: how many elements the main body processes per iterationtail strategy: how to handle remainder elements whennis not a multiple ofSmain_body: the code that runs for each full chunk ofSelements
The range variable (here n) is automatically injected as an i32 parameter into the function signature. You do not declare it in the parameter list.
How it desugars
A kernel like this:
export kernel scale(data: *f32, out: *mut f32, factor: f32)
over i in n step 4
tail scalar { out[i] = data[i] * factor }
{
out[i] = data[i] * factor
out[i + 1] = data[i + 1] * factor
out[i + 2] = data[i + 2] * factor
out[i + 3] = data[i + 3] * factor
}
becomes equivalent to:
export func scale(data: *f32, out: *mut f32, factor: f32, n: i32) {
let mut i: i32 = 0
while i + 4 <= n {
out[i] = data[i] * factor
out[i + 1] = data[i + 1] * factor
out[i + 2] = data[i + 2] * factor
out[i + 3] = data[i + 3] * factor
i = i + 4
}
// tail: process remainder one at a time
while i < n {
out[i] = data[i] * factor
i = i + 1
}
}
The generated C signature is void scale(float*, float*, float, int) -- the n parameter appears last.
Tail strategies
The tail handles remainder elements when n is not evenly divisible by the step.
tail scalar
Process remainder elements one at a time. The tail body runs in a loop with step 1:
export kernel vscale(data: *f32, out: *mut f32, factor: f32)
over i in n step 8
tail scalar {
out[i] = data[i] * factor
}
{
let vf: f32x8 = splat(factor)
store(out, i, load(data, i) .* vf)
}
The main body uses 8-wide SIMD. The tail body is scalar code that handles 0 to 7 leftover elements. This is the most common tail strategy.
tail mask
Use masked load/store for the remainder. The tail body runs once (not in a loop) and must handle all remaining elements using masked operations:
export kernel vscale(data: *f32, out: *mut f32, factor: f32)
over i in len step 4
tail mask {
let rem: i32 = len - i
let vf: f32x4 = splat(factor)
let v: f32x4 = load_masked(data, i, rem)
store_masked(out, i, v .* vf, rem)
}
{
let vf: f32x4 = splat(factor)
store(out, i, load(data, i) .* vf)
}
The rem variable tells the masked operations how many elements are valid. This avoids the scalar loop entirely but requires masked intrinsics.
tail pad
The caller guarantees the input length is a multiple of the step. No tail body is generated:
export kernel fill(out: *mut i32, val: i32)
over i in n step 4
tail pad
{
out[i] = val
out[i + 1] = val
out[i + 2] = val
out[i + 3] = val
}
This produces the most efficient code but shifts the responsibility to the caller. If n is not a multiple of the step, the kernel will skip the remaining elements.
No tail clause
If you omit the tail clause entirely, the kernel only runs the main body. Remaining elements are not processed:
export kernel double_it(data: *i32, out: *mut i32)
over i in n step 1
{
out[i] = data[i] * 2
}
With step 1, there is no remainder, so omitting the tail is safe. With larger steps, you must ensure n is always a multiple of the step, or accept that trailing elements are skipped.
Complete example
A SIMD dot product kernel that handles any input length:
export kernel dot(a: *f32, b: *f32, out: *mut f32)
over i in n step 8
tail scalar {
out[0] = out[0] + a[i] * b[i]
}
{
let va: f32x8 = load(a, i)
let vb: f32x8 = load(b, i)
let products: f32x8 = va .* vb
let sum: f32 = reduce_add(products)
out[0] = out[0] + sum
}
The main body loads 8-wide vectors, multiplies them, reduces to a scalar sum, and accumulates into out[0]. The scalar tail handles any remaining 0-7 elements.
Structs
Structs in Ea are plain data containers with C-compatible memory layout. They have no methods, no constructors, no impl blocks. They exist so you can pass structured data between Ea kernels and host languages.
Defining a struct
struct Particle {
x: f32,
y: f32,
mass: f32,
}
Fields can be any scalar type. The memory layout matches C struct layout, so a Particle in Ea is identical to:
typedef struct { float x; float y; float mass; } Particle;
Creating and accessing
Inside Ea functions, you create a struct with literal syntax and access fields with dot notation:
func main() {
let p: Particle = Particle { x: 1.0, y: 2.0, mass: 10.0 }
println(p.x) // 1
println(p.mass) // 10
}
Mutable structs support field assignment:
func main() {
let mut p: Particle = Particle { x: 0.0, y: 0.0, mass: 1.0 }
p.x = 3.5
p.y = 7.0
println(p.x) // 3.5
}
Struct pointers
In exported functions, structs are typically passed via pointer from the host language:
struct Point {
x: f32,
y: f32,
}
export func get_x(p: *Point) -> f32 {
return p.x
}
export func set_point(p: *mut Point, nx: f32, ny: f32) {
p.x = nx
p.y = ny
}
Read-only pointers (*Point) allow field reads. Mutable pointers (*mut Point) allow field writes.
Arrays of structs
Pointer indexing works with struct arrays. Each element is a full struct:
struct Vec2 {
x: f32,
y: f32,
}
export func sum_x(vecs: *Vec2, n: i32) -> f32 {
let mut total: f32 = 0.0
let mut i: i32 = 0
while i < n {
total = total + vecs[i].x
i = i + 1
}
return total
}
From C, this is called with a pointer to a contiguous array of structs:
typedef struct { float x; float y; } Vec2;
extern float sum_x(const Vec2*, int);
Vec2 vecs[] = { {1.0f, 10.0f}, {2.0f, 20.0f}, {3.0f, 30.0f} };
float result = sum_x(vecs, 3); // 6.0
Passing from Python
Since Ea structs match C layout, you can use NumPy structured arrays or ctypes:
import numpy as np
particle_dtype = np.dtype([
('x', np.float32),
('y', np.float32),
('mass', np.float32),
])
particles = np.zeros(1000, dtype=particle_dtype)
The generated Python bindings handle the pointer passing automatically.
Limitations
- No methods or impl blocks. Structs are data only.
- No nested structs.
- No generics. Write separate struct definitions for each concrete type.
- Struct fields must be scalar types.
Common Intrinsics
This page covers the most frequently used intrinsics. For the complete list, see the Intrinsics Reference.
splat
Broadcast a scalar to all lanes of a vector:
let factor: f32 = 2.5
let vf: f32x8 = splat(factor)
// vf = [2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5]
Works with all vector types. The return type is inferred from the variable's type annotation.
load / store
Load a vector from a pointer at an element offset. Store writes a vector back:
let v: f32x8 = load(data, i) // read 8 floats starting at data[i]
store(out, i, v) // write 8 floats starting at out[i]
Offsets are in elements, not bytes.
Typed loads
When you want to be explicit about the element type:
let v: f32x8 = load_f32(data, i)
let v: f32x4 = load_f32x4(data, i)
These are equivalent to plain load but make the element type visible in the source.
stream_store
Non-temporal store that bypasses the CPU cache. Use for write-only output buffers where you will not read the data back soon:
let result: f32x8 = a .* b
stream_store(out, i, result)
This avoids polluting the cache with output data, leaving more cache space for inputs. Only beneficial for large output arrays that will not be re-read immediately.
fma
Fused multiply-add: computes a * b + c in a single instruction with a single rounding (more accurate than separate multiply and add):
// Scalar
let result: f32 = fma(a, b, c)
// Vector
let va: f32x8 = load(a, i)
let vb: f32x8 = load(b, i)
let vc: f32x8 = load(c, i)
let result: f32x8 = fma(va, vb, vc)
Works on f32, f64, and all float vector types. Maps to the hardware FMA instruction.
reduce_add
Sum all lanes of a vector down to a scalar:
let v: f32x8 = load(data, i)
let sum: f32 = reduce_add(v)
Works on all integer and float vector types. Useful for dot products, reductions, and histogram accumulation.
reduce_max / reduce_min
Find the maximum or minimum value across all lanes:
let v: f32x4 = load(data, i)
let biggest: f32 = reduce_max(v)
let smallest: f32 = reduce_min(v)
Works on integer and float vector types.
select
Per-lane conditional: where the mask is true, take from a; where false, take from b:
let mask: f32x8 = a .> b
let result: f32x8 = select(mask, a, b)
// element-wise max(a, b)
Compiles to a blend instruction. No branching. This is how you write branchless SIMD code.
sqrt / rsqrt
Square root and reciprocal square root (1/sqrt):
let x: f32 = 16.0
let root: f32 = sqrt(x) // 4.0
let v: f32x8 = load(data, i)
let roots: f32x8 = sqrt(v)
let inv_roots: f32x8 = rsqrt(v)
rsqrt uses the fast hardware approximation. Works on f32, f64, and float vector types.
min / max
Element-wise minimum and maximum. Works on both scalars and vectors:
// Scalar
let smaller: f32 = min(a, b)
let larger: f32 = max(a, b)
// Vector
let va: f32x8 = load(data, i)
let vb: f32x8 = load(data, j)
let mins: f32x8 = min(va, vb)
let maxs: f32x8 = max(va, vb)
movemask
Extract comparison results to an integer bitmask. x86 only:
let mask: f32x8 = a .> b
let bits: i32 = movemask(mask)
// bit k is 1 if lane k of a > lane k of b
Useful for branching on SIMD comparison results or counting matching elements. Each bit in the result corresponds to one vector lane.
Masked memory operations
For tail handling, masked loads and stores read/write only the valid lanes:
let rem: i32 = n - i
let v: f32x4 = load_masked(data, i, rem) // load only 'rem' elements
store_masked(out, i, result, rem) // store only 'rem' elements
The rem parameter specifies how many elements (starting from lane 0) are valid. Lanes beyond rem are zero-filled on load and not written on store.
Summary table
| Intrinsic | Input | Output | Description |
|---|---|---|---|
splat(s) | scalar | vector | Broadcast to all lanes |
load(ptr, i) | pointer, offset | vector | Load vector from memory |
store(ptr, i, v) | pointer, offset, vector | void | Write vector to memory |
stream_store(ptr, i, v) | pointer, offset, vector | void | Non-temporal write |
fma(a, b, c) | 3 values | same type | a * b + c fused |
reduce_add(v) | vector | scalar | Sum all lanes |
reduce_max(v) | vector | scalar | Max across lanes |
reduce_min(v) | vector | scalar | Min across lanes |
select(m, a, b) | mask, 2 vectors | vector | Per-lane conditional |
sqrt(x) | scalar/vector | same type | Square root |
rsqrt(x) | scalar/vector | same type | Reciprocal square root |
min(a, b) | 2 values | same type | Element-wise minimum |
max(a, b) | 2 values | same type | Element-wise maximum |
movemask(m) | bool vector | i32 | Extract lane mask to bits |
Type System
Ea is statically typed with no implicit conversions. Every variable, parameter, and expression has a concrete type known at compile time.
Scalar Types
| Type | Size | Description |
|---|---|---|
i8 | 1 byte | Signed 8-bit integer |
u8 | 1 byte | Unsigned 8-bit integer |
i16 | 2 bytes | Signed 16-bit integer |
u16 | 2 bytes | Unsigned 16-bit integer |
i32 | 4 bytes | Signed 32-bit integer |
u32 | 4 bytes | Unsigned 32-bit integer |
i64 | 8 bytes | Signed 64-bit integer |
u64 | 8 bytes | Unsigned 64-bit integer |
f32 | 4 bytes | 32-bit float (IEEE 754) |
f64 | 8 bytes | 64-bit float (IEEE 754) |
bool | 1 byte | Boolean (true or false) |
Integer literals default to i32. Float literals default to f32. Use explicit casts (to_f64(x), to_i64(x)) to convert between types.
Vector Types
Vector types hold multiple lanes of the same scalar type. Element-wise operations use dot-operators (.+, .-, .*, ./, etc.).
128-bit Vectors -- SSE (x86) / NEON (ARM)
| Type | Lanes | Element | Size |
|---|---|---|---|
f32x4 | 4 | f32 | 16 bytes |
f64x2 | 2 | f64 | 16 bytes |
i32x4 | 4 | i32 | 16 bytes |
i16x8 | 8 | i16 | 16 bytes |
i8x16 | 16 | i8 | 16 bytes |
u8x16 | 16 | u8 | 16 bytes |
u16x8 | 8 | u16 | 16 bytes |
256-bit Vectors -- AVX2 (x86 only)
| Type | Lanes | Element | Size |
|---|---|---|---|
f32x8 | 8 | f32 | 32 bytes |
f64x4 | 4 | f64 | 32 bytes |
i32x8 | 8 | i32 | 32 bytes |
i16x16 | 16 | i16 | 32 bytes |
i8x32 | 32 | i8 | 32 bytes |
u8x32 | 32 | u8 | 32 bytes |
u16x16 | 16 | u16 | 32 bytes |
These types produce a compile error on ARM targets.
512-bit Vectors -- AVX-512 (x86, --avx512 flag required)
| Type | Lanes | Element | Size |
|---|---|---|---|
f32x16 | 16 | f32 | 64 bytes |
f64x8 | 8 | f64 | 64 bytes |
i32x16 | 16 | i32 | 64 bytes |
Using these types without --avx512 produces a compile error.
Pointer Types
Pointers represent caller-provided memory. Ea never allocates -- all memory comes from the host language.
| Syntax | Description |
|---|---|
*T | Immutable pointer to T |
*mut T | Mutable pointer to T |
*restrict T | Immutable pointer, no-alias guarantee |
*restrict mut T | Mutable pointer, no-alias guarantee |
The restrict qualifier tells the compiler that the pointer does not alias other pointers, enabling stronger optimizations. Use it when you can guarantee non-overlapping memory.
Struct Types
User-defined value types declared with struct:
struct Pixel {
r: f32,
g: f32,
b: f32,
a: f32,
}
Structs are passed by value. Access fields with dot syntax: pixel.r. Structs can contain scalar types, vector types, and other structs.
Type Rules
- No implicit conversions between types. Use
to_f32(),to_i32(), etc. - No generics or polymorphism. Write separate functions for each type.
- Vector dot-operators require both operands to have the same vector type.
- Comparisons on vectors (
.==,.<,.>) produce boolean vectors, not scalar bools.
All Intrinsics
Complete reference for every built-in function in Ea.
Memory
load
Load a vector from a pointer at a byte offset. The return type is inferred from context.
let v: f32x8 = load(ptr, i);
Typed Scalar Loads
Load a single scalar value from a pointer at a byte offset.
| Intrinsic | Return Type |
|---|---|
load_f32(ptr, i) | f32 |
load_f64(ptr, i) | f64 |
load_i32(ptr, i) | i32 |
load_i16(ptr, i) | i16 |
load_i8(ptr, i) | i8 |
load_u8(ptr, i) | u8 |
load_u16(ptr, i) | u16 |
load_u32(ptr, i) | u32 |
load_u64(ptr, i) | u64 |
let x: f32 = load_f32(ptr, i);
Typed Vector Loads
Load a full vector from a pointer at a byte offset.
| Intrinsic | Return Type |
|---|---|
load_f32x4(ptr, i) | f32x4 |
load_f32x8(ptr, i) | f32x8 |
load_f32x16(ptr, i) | f32x16 |
load_i32x4(ptr, i) | i32x4 |
load_i32x8(ptr, i) | i32x8 |
load_i16x8(ptr, i) | i16x8 |
load_i8x16(ptr, i) | i8x16 |
load_u8x16(ptr, i) | u8x16 |
load_u8x32(ptr, i) | u8x32 |
let v: f32x8 = load_f32x8(data, i * 32);
store
Write a vector to a pointer at a byte offset.
store(out, i, result);
stream_store
Non-temporal store that bypasses the CPU cache. Use for write-only output that will not be read back soon.
stream_store(out, i, result);
load_masked
Masked vector load. Lanes where the mask is false are not loaded.
let v: f32x8 = load_masked(ptr, i, mask);
store_masked
Masked vector store. Only lanes where the mask is true are written.
store_masked(out, i, value, mask);
gather
Load elements from scattered memory addresses using an index vector. x86 only -- not available on ARM.
let v: f32x8 = gather(ptr, indices);
scatter
Store elements to scattered memory addresses using an index vector. AVX-512 only (--avx512 flag required).
scatter(ptr, indices, values);
prefetch
Issue a prefetch hint to bring data into cache.
prefetch(ptr, i);
Math
sqrt
Square root. Works on scalar f32/f64 and all float vector types.
let y: f32 = sqrt(x);
let v: f32x8 = sqrt(vec);
rsqrt
Reciprocal square root (approximate). Scalar f32 and f32 vector types.
let y: f32 = rsqrt(x);
let v: f32x8 = rsqrt(vec);
exp
Exponential function. Float types.
let y: f32 = exp(x);
fma
Fused multiply-add: computes a * b + c in a single operation with one rounding step. Works on scalar f32 and all float vector types.
let y: f32 = fma(a, b, c);
let v: f32x8 = fma(va, vb, vc);
min
Element-wise minimum. Works on scalar (i32, f32, f64) and vector types.
let m: i32 = min(a, b);
let v: f32x8 = min(va, vb);
max
Element-wise maximum. Works on scalar (i32, f32, f64) and vector types.
let m: i32 = max(a, b);
let v: f32x8 = max(va, vb);
Reduction
Reduce a vector to a single scalar value.
reduce_add
Sum all lanes.
let sum: f32 = reduce_add(v); // f32x8 -> f32
let sum: i32 = reduce_add(iv); // i32x8 -> i32
reduce_max
Maximum across all lanes.
let m: f32 = reduce_max(v);
reduce_min
Minimum across all lanes.
let m: f32 = reduce_min(v);
reduce_add_fast
Unordered float reduction. Faster than reduce_add but does not guarantee summation order, so results may differ slightly due to floating-point rounding. Float vectors only.
let sum: f32 = reduce_add_fast(v);
Vector
splat
Broadcast a scalar value to all lanes of a vector. The vector type is inferred from context.
let v: f32x8 = splat(1.0);
shuffle
Reorder lanes of a vector according to an index tuple. The indices are compile-time constants.
let reversed: f32x4 = shuffle(v, (3, 2, 1, 0));
select
Per-lane conditional select. Where the mask is true, take from a; where false, take from b.
let result: f32x8 = select(mask, a, b);
movemask
Extract a comparison result bitmask from a boolean vector to a scalar i32. Each bit corresponds to the sign bit of one lane. x86 only -- not available on ARM.
let bits: i32 = movemask(cmp_result);
Conversion
Scalar Casts
| Intrinsic | Description |
|---|---|
to_f32(x) | Convert to f32 |
to_f64(x) | Convert to f64 |
to_i32(x) | Convert to i32 |
to_i64(x) | Convert to i64 |
let f: f32 = to_f32(i);
let n: i32 = to_i32(x);
Widening Conversions
Widen narrow integer lanes to wider float or integer lanes. Only the first N lanes of the input are consumed.
| Intrinsic | Input | Output |
|---|---|---|
widen_i8_f32x4(v) | i8x16 | f32x4 |
widen_u8_f32x4(v) | u8x16 | f32x4 |
widen_i8_f32x8(v) | i8x16 | f32x8 |
widen_u8_f32x8(v) | u8x16 | f32x8 |
widen_u8_i32x4(v) | u8x16 | i32x4 |
widen_u8_i32x8(v) | u8x16 | i32x8 |
let pixels: f32x8 = widen_u8_f32x8(raw_bytes);
Narrowing Conversions
Convert wider lanes to narrower lanes, with clamping and rounding.
| Intrinsic | Input | Output |
|---|---|---|
narrow_f32x4_i8(v) | f32x4 | i8 (4 bytes) |
let packed = narrow_f32x4_i8(float_pixels);
Multiply-Add Byte Pairs
Multiply unsigned bytes by signed bytes and add adjacent pairs. x86 only.
| Intrinsic | Signature | Description |
|---|---|---|
maddubs_i16(a, b) | (u8x16, i8x16) -> i16x8 | Multiply and add adjacent pairs to 16-bit |
maddubs_i32(a, b) | (u8x16, i8x16) -> i32x4 | Multiply and add adjacent quads to 32-bit |
let products: i16x8 = maddubs_i16(unsigned_bytes, signed_weights);
Debug
println
Print a value to stdout. Accepts scalars (i32, i64, u8, u16, u32, u64, f32, f64, bool), string literals, and vector types. Lowers to C printf. No format strings.
println(42);
println(3.14);
println("hello");
println(my_vector);
CLI Reference
The ea binary provides three commands: compile (default), bind, and inspect.
Compile
ea <file.ea> [flags]
Compile an Ea source file to a native object file (.o) by default.
Flags
| Flag | Effect |
|---|---|
-o <name> | Link the object file into an executable via cc |
--lib | Produce a shared library (.so/.dll) and metadata (.ea.json) |
--opt-level=N | Optimization level 0--3 (default: 3) |
--avx512 | Enable AVX-512 vector types and intrinsics. Errors on ARM targets |
--target=CPU | LLVM CPU name, e.g. skylake, znver3, native (default: native) |
--target-triple=T | Cross-compile to a different architecture, e.g. aarch64-unknown-linux-gnu |
--emit-llvm | Write LLVM IR to a .ll file and print it to stdout |
--emit-asm | Write assembly to a .s file |
--header | Generate a C header (.h) for the exported functions |
--emit-ast | Print the parsed AST. Does not require LLVM |
--emit-tokens | Print the token stream. Does not require LLVM |
--help / -h | Print usage information |
--version / -V | Print compiler version |
Examples
# Compile to object file
ea kernel.ea
# Compile and link to executable
ea kernel.ea -o kernel
# Build shared library for Python/Rust/C++ consumption
ea kernel.ea --lib
# Cross-compile for ARM
ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu
# Emit LLVM IR for debugging
ea kernel.ea --emit-llvm
# Compile with AVX-512 support
ea kernel.ea --lib --avx512
# Generate C header
ea kernel.ea --header
Compiler status output goes to stderr, so --emit-llvm stdout is clean for piping.
Bind
ea bind <file.ea> --python [--rust] [--cpp] [--pytorch] [--cmake]
Generate language bindings from a compiled kernel. At least one language flag is required.
Requires the .ea.json metadata file produced by ea <file.ea> --lib. Run the --lib compile first.
Language Flags
| Flag | Output | Description |
|---|---|---|
--python | <name>.py | Python wrapper using ctypes |
--rust | <name>.rs | Rust FFI bindings |
--cpp | <name>.hpp | C++ header with inline wrappers |
--pytorch | <name>_torch.py | PyTorch custom op wrapper |
--cmake | CMakeLists.txt | CMake build file for C++ integration |
Example
# Full workflow: compile, then generate Python bindings
ea kernel.ea --lib
ea bind kernel.ea --python
Inspect
ea inspect <file.ea> [target flags]
Post-optimization analysis of the compiled kernel. Shows instruction mix, loop structure, vector width usage, and register pressure. Accepts the same target flags as compile (--target, --target-triple, --avx512).
Example
ea inspect kernel.ea
ea inspect kernel.ea --avx512
ea inspect kernel.ea --target-triple=aarch64-unknown-linux-gnu
Print Target
ea --print-target
Print the resolved native CPU name for the current machine. Useful for understanding which CPU features the compiler will target by default.
Python API
The ea Python package (pip install ea-compiler) provides a high-level interface for compiling and loading Ea kernels.
Functions
ea.load
ea.load(path, *, target="native", opt_level=3, avx512=False) -> KernelModule
Compile an .ea file to a shared library and load it. Returns a KernelModule object with each exported function available as a method.
| Parameter | Type | Default | Description |
|---|---|---|---|
path | str | required | Path to the .ea source file |
target | str | "native" | LLVM CPU name (e.g. "skylake", "native") |
opt_level | int | 3 | Optimization level 0--3 |
avx512 | bool | False | Enable AVX-512 types and intrinsics |
import ea
import numpy as np
k = ea.load("kernel.ea")
a = np.array([1.0, 2.0, 3.0], dtype=np.float32)
b = np.zeros(3, dtype=np.float32)
k.my_func(a, b, len(a))
ea.compile
ea.compile(path, *, emit_asm=False, emit_llvm=False, target="native", opt_level=3, avx512=False, lib=True) -> Path
Compile an .ea file without loading it. Returns the path to the output file. Useful when you need the .so or .ea.json for another tool.
| Parameter | Type | Default | Description |
|---|---|---|---|
path | str | required | Path to the .ea source file |
emit_asm | bool | False | Also write a .s assembly file |
emit_llvm | bool | False | Also write a .ll LLVM IR file |
target | str | "native" | LLVM CPU name |
opt_level | int | 3 | Optimization level 0--3 |
avx512 | bool | False | Enable AVX-512 |
lib | bool | True | Produce .so + .ea.json |
ea.clear_cache
ea.clear_cache(path=None)
Clear the compilation cache. If path is given, clear only the cache for that .ea file. If None, clear the entire cache directory.
ea.compiler_version
ea.compiler_version() -> str
Return the version string of the ea compiler binary.
ea.__version__
ea.__version__ -> str
The version of the ea Python package.
Exceptions
ea.CompileError
Raised when compilation fails. Inherits from RuntimeError.
| Attribute | Type | Description |
|---|---|---|
stderr | str | Full compiler error output |
exit_code | int | Compiler process exit code |
try:
k = ea.load("broken.ea")
except ea.CompileError as e:
print(e.stderr)
print(e.exit_code)
Caching
Compiled shared libraries are cached in a __eacache__/ directory next to the .ea source file. The cache key includes the CPU name and compiler version:
__eacache__/{cpu}-{version}/kernel.so
The cache is invalidated by file modification time (mtime). If the source file is newer than the cached library, ea.load() recompiles automatically.
Length Collapsing
When a function parameter named n, len, length, count, size, or num appears immediately after a pointer parameter and has an integer type, the Python binding automatically fills it from the array's length. You do not need to pass it explicitly.
// Ea source
export func scale(data: *mut f32, n: i32, factor: f32) { ... }
# Python: n is auto-filled from len(data)
k.scale(data, factor=2.0)
Output Allocation
Parameters annotated with out and a [cap: ...] clause are automatically allocated by the Python binding and returned as the function's result.
// Ea source
export func transform(input: *f32, n: i32, out result: *mut f32 [cap: n]) { ... }
# Python: result is allocated and returned
output = k.transform(input_array)
If [count: path] is also specified, the returned array is trimmed to the actual output length.
Thread Safety
Loaded kernel modules and their functions are safe for concurrent use from multiple threads. The compiled code is stateless -- all memory is caller-provided via arguments.
Binding Annotations
Ea generates language bindings from compiled kernel metadata. This page documents the annotation syntax, the .ea.json metadata format, and the available binding generators.
Output Annotations
Mark a parameter as caller-allocated output with the out keyword and a capacity clause:
export func filter(
input: *f32,
n: i32,
out result: *mut f32 [cap: n, count: out_count],
out out_count: *mut i32 [cap: 1],
) { ... }
Syntax
out name: *mut T [cap: <expr>]
out name: *mut T [cap: <expr>, count: <path>]
| Clause | Required | Description |
|---|---|---|
cap | Yes | Number of elements to allocate. Can reference other parameters (e.g. n, width * height) |
count | No | Parameter or expression giving the actual output length. The binding trims the returned array to this length |
In generated bindings, out parameters are not part of the function signature. They are allocated internally and returned as the function's result.
Length Collapsing
When a pointer parameter is followed immediately by an integer parameter whose name matches one of the recognized patterns, the binding generators automatically fill the integer from the array's length.
Recognized Names
n, len, length, count, size, num
Rules
- The integer parameter must appear immediately after a pointer parameter
- The integer parameter must have an integer type (
i32,i64,u32, etc.) - The pointer and integer are collapsed into a single array argument in the binding
// These two parameters collapse into one array argument:
export func sum(data: *f32, n: i32) -> f32 { ... }
# Python: just pass the array, n is filled automatically
result = k.sum(my_array)
Metadata Format (.ea.json)
Compiling with --lib produces a .ea.json file alongside the shared library. This file describes the exported API and is consumed by ea bind.
{
"library": "libkernel.so",
"exports": [
{
"name": "scale",
"args": [
{"name": "data", "type": "*mut f32"},
{"name": "n", "type": "i32"},
{"name": "factor", "type": "f32"}
],
"return_type": null
}
],
"structs": [
{
"name": "Point",
"fields": [
{"name": "x", "type": "f32"},
{"name": "y", "type": "f32"}
]
}
]
}
Output-annotated parameters include additional fields:
{
"name": "result",
"type": "*mut f32",
"output": true,
"cap": "n",
"count": "out_count"
}
Binding Generators
Generate bindings with ea bind <file.ea> and one or more language flags. The .ea.json file must exist (run ea <file.ea> --lib first).
| Flag | Output File | Description |
|---|---|---|
--python | <name>.py | Python wrapper using ctypes. Arrays as NumPy ndarrays. |
--rust | <name>.rs | Rust extern "C" declarations with safe wrapper functions |
--cpp | <name>.hpp | C++ header-only bindings with std::span parameters |
--pytorch | <name>_torch.py | PyTorch custom op with torch.Tensor parameters |
--cmake | CMakeLists.txt | CMake project for linking the shared library in C++ |
All generated bindings use C ABI function pointers loaded from the shared library at runtime.
ARM / NEON
Ea supports AArch64 with NEON vector instructions. This page documents the differences from x86 and how to write portable kernels.
Vector Width
ARM NEON provides 128-bit vector registers. Only 128-bit vector types are available:
| Supported | Not Supported |
|---|---|
f32x4 | f32x8, f32x16 |
f64x2 | f64x4, f64x8 |
i32x4 | i32x8, i32x16 |
i16x8 | i16x16 |
i8x16 | i8x32 |
u8x16 | u8x32 |
u16x8 | u16x16 |
Using a 256-bit or 512-bit vector type on an ARM target produces a compile error. This is intentional -- Ea does not silently fall back to scalar code.
Unavailable Intrinsics
The following intrinsics are x86-only and produce a compile error on ARM:
| Intrinsic | Reason |
|---|---|
movemask(v) | No ARM equivalent for extracting lane sign bits to a bitmask |
gather(ptr, indices) | No hardware gather support in NEON |
scatter(ptr, indices, values) | AVX-512 only |
maddubs_i16(a, b) | x86 PMADDUBSW instruction, no NEON equivalent |
maddubs_i32(a, b) | x86 specific |
All other intrinsics (loads, stores, math, reductions, splat, select, shuffle, conversions) work on ARM.
Cross-Compilation
Compile for ARM from an x86 host:
ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu
When compiling natively on an ARM machine, no special flags are needed:
ea kernel.ea --lib
The --avx512 flag is rejected on ARM targets with a compile error.
Writing Portable Kernels
Strategy 1: Use 128-bit types everywhere
Use f32x4, i32x4, etc. These work on both x86 (SSE) and ARM (NEON).
kernel scale(data: *mut f32, factor: f32) range(n) step(4) {
let v: f32x4 = load(data, i);
let f: f32x4 = splat(factor);
store(data, i, v .* f);
}
This sacrifices throughput on x86 (which could use f32x8 with AVX2) but runs everywhere.
Strategy 2: Separate kernel files
Write platform-specific kernels in separate files and load the right one at runtime:
# kernel_x86.ea -- uses f32x8 for AVX2
# kernel_arm.ea -- uses f32x4 for NEON
import platform
if platform.machine() == "aarch64":
k = ea.load("kernel_arm.ea")
else:
k = ea.load("kernel_x86.ea")
Both files export the same function signatures, so the calling code does not change.
Strategy 3: Use the kernel construct
The kernel construct with step(N) lets you pick the vector width per file while keeping the loop logic identical. Write two .ea files with different step sizes and vector types, but the same exported function name and parameters.
Ea vs NumPy
When does writing a kernel in Ea actually beat NumPy? The answer comes down to one thing: arithmetic intensity -- how many operations you perform per byte loaded from memory.
The memory bandwidth wall
Modern CPUs can process arithmetic far faster than they can load data from DRAM. A single DDR4 channel delivers ~30-40 GB/s. NumPy's ufuncs are already compiled C with SIMD -- for simple operations, they saturate this bandwidth just like hand-written SIMD would.
Rule of thumb: if your operation does fewer than 2 arithmetic ops per element loaded, it is bandwidth-bound. Ea will match NumPy but not beat it.
Bandwidth-bound: Ea matches, no win
Array scaling
export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
let f: f32x8 = splat(factor)
let mut i: i32 = 0
while i < n {
let v: f32x8 = load(src, i)
store(dst, i, v .* f)
i = i + 8
}
}
dst = src * factor # NumPy: one SIMD multiply per element, same speed
One multiply per element loaded. Both Ea and NumPy hit ~35 GB/s on typical hardware. No winner.
Simple element-wise ops
Any operation that loads an element, does one thing, and stores the result is bandwidth-bound:
dst[i] = src[i] + offsetdst[i] = abs(src[i])dst[i] = src_a[i] + src_b[i]
NumPy already handles these at memory bandwidth speed. Writing an Ea kernel gains you nothing.
Compute-bound: Ea wins
Fused scale + bias + clamp
NumPy must make three separate passes over memory:
dst = np.clip(src * scale + bias, 0.0, 1.0) # 3 temporaries, 3 passes
Each pass loads the full array, computes one operation, and writes a temporary. For a 100 MB array, that is 600 MB of memory traffic.
Ea fuses everything into a single pass:
export func fused_scale_bias_clamp(src: *f32, dst: *mut f32, scale: f32, bias: f32, n: i32) {
let s: f32x8 = splat(scale)
let b: f32x8 = splat(bias)
let zero: f32x8 = splat(0.0)
let one: f32x8 = splat(1.0)
let mut i: i32 = 0
while i < n {
let v: f32x8 = load(src, i)
let result: f32x8 = fma(v, s, b)
let clamped: f32x8 = min(max(result, zero), one)
store(dst, i, clamped)
i = i + 8
}
}
One load, one FMA, two comparisons, one store. The data stays in registers the entire time. This is 3-5x faster than the NumPy version on large arrays because it reads memory once instead of three times.
Stencil operations
Convolutions, Sobel filters, and any operation that reads multiple neighboring elements per output pixel have high arithmetic intensity. A 3x3 Sobel kernel reads 9 values and performs 9 multiplications plus 8 additions per output -- well above the compute-bound threshold.
See Image Processing for stencil patterns.
Custom reductions with branching
NumPy cannot express per-element branching in vectorized form. Operations like "accumulate values, but skip negatives and double values above a threshold" require Python-level loops or awkward np.where chains.
Ea gives you SIMD comparisons and masked operations in a single loop body, keeping the pipeline full.
Dot products and FMA chains
Any reduction that multiplies and accumulates benefits from FMA (fused multiply-add) and register-level accumulation:
let acc: f32x8 = fma(a, b, acc) // a * b + acc in one instruction
NumPy's np.dot is fast for large matrices (it calls BLAS), but for custom reductions, per-row operations, or non-standard accumulation patterns, Ea's explicit FMA beats NumPy's element-wise approach.
See ML Preprocessing for dot product and similarity patterns.
Decision checklist
Before writing an Ea kernel, ask:
- How many ops per element? If just one (scale, offset, abs), stay with NumPy.
- Does NumPy need multiple passes? If your expression chains 3+ operations, Ea's fusion wins.
- Is there a stencil or neighbor access? High arithmetic intensity -- Ea wins.
- Is there branching logic per element? NumPy cannot vectorize this -- Ea wins.
- Is it a standard BLAS operation? Use NumPy/SciPy -- they call optimized libraries.
Real-world packages
These packages demonstrate Ea beating NumPy on compute-bound workloads:
- easobel -- Sobel edge detection (stencil, ~9 ops/pixel)
- eastat -- CSV parsing (branching SIMD scan)
- eavec -- Vector similarity search (FMA-heavy dot products)
Image Processing
Image processing is one of Ea's strongest use cases. Stencil operations (convolution, edge detection, blur) read multiple neighbors per output pixel, giving high arithmetic intensity that keeps the CPU compute-bound rather than memory-bound.
Sobel edge detection
The Sobel filter computes horizontal and vertical gradients using 3x3 stencils. Each output pixel reads 9 input values and performs 9 multiplications plus additions -- well above the threshold where Ea beats NumPy.
The pattern: for each pixel, load the 3x3 neighborhood, multiply by Sobel coefficients, and sum:
export func sobel_x(
src: *f32, dst: *mut f32,
width: i32, height: i32
) {
let neg1: f32x8 = splat(-1.0)
let pos1: f32x8 = splat(1.0)
let neg2: f32x8 = splat(-2.0)
let pos2: f32x8 = splat(2.0)
let mut y: i32 = 1
while y < height - 1 {
let mut x: i32 = 0
while x < width - 2 {
let row_above: i32 = (y - 1) * width + x
let row_center: i32 = y * width + x
let row_below: i32 = (y + 1) * width + x
let tl: f32x8 = load(src, row_above)
let tr: f32x8 = load(src, row_above + 2)
let ml: f32x8 = load(src, row_center)
let mr: f32x8 = load(src, row_center + 2)
let bl: f32x8 = load(src, row_below)
let br: f32x8 = load(src, row_below + 2)
let gx: f32x8 = (tr .* pos1) .+ (mr .* pos2) .+ (br .* pos1)
.+ (tl .* neg1) .+ (ml .* neg2) .+ (bl .* neg1)
store(dst, row_center + 1, gx)
x = x + 8
}
y = y + 1
}
}
For a production-ready Sobel implementation with both Gx/Gy gradients and magnitude computation, see easobel.
Pixel pipeline: u8 to f32 and back
Images from disk arrive as u8 (0-255). SIMD math works best on f32. The typical pattern:
- Load u8 pixels
- Widen to f32 (0.0 to 255.0)
- Process in f32 (normalize, filter, blend)
- Narrow back to u8
Widening: u8 to f32
export func normalize_u8_to_f32(src: *u8, dst: *mut f32, n: i32) {
let scale: f32x4 = splat(0.00392156862)
let mut i: i32 = 0
while i < n {
let pixels: f32x4 = widen_u8_f32x4(src, i)
let normalized: f32x4 = pixels .* scale
store(dst, i, normalized)
i = i + 4
}
}
widen_u8_f32x4(ptr, offset) loads 4 bytes from src + offset, zero-extends each to 32 bits, and converts to float. The result is a f32x4 with values in 0.0 to 255.0. Multiply by 1/255 to get the 0.0-1.0 range.
Narrowing: f32 to u8
export func f32_to_u8(src: *f32, dst: *mut u8, n: i32) {
let s255: f32x4 = splat(255.0)
let zero: f32x4 = splat(0.0)
let mut i: i32 = 0
while i < n {
let v: f32x4 = load(src, i)
let clamped: f32x4 = min(max(v, zero), s255)
narrow_f32x4_i8(dst, i, clamped)
i = i + 4
}
}
narrow_f32x4_i8(ptr, offset, vec) converts 4 floats to integers, saturates to 0-255, and stores 4 bytes. Always clamp before narrowing to avoid overflow.
Putting it together
A full pixel pipeline (load u8, process in f32, store u8) processes 4 pixels per iteration on both x86 and ARM. Use f32x4 for the pipeline to keep it portable -- f32x8 works only on x86 with AVX2.
Why image stencils are compute-bound
A 3x3 convolution on a single-channel image performs 9 multiplications and 8 additions per output pixel, but only produces 1 output value. That is 17 arithmetic operations per output float -- far above the ~2 ops/element threshold where Ea's operation fusion matters.
For multi-channel images (RGB, RGBA), the arithmetic intensity is even higher because you process 3-4 channels per pixel position.
Compare this to simple brightness adjustment (pixel * 1.1), which is 1 op per element -- bandwidth-bound, and NumPy handles it just as fast. See Ea vs NumPy for more on this distinction.
Tips
- Border handling: the examples above skip border pixels (starting at y=1, ending at height-1). For production kernels, handle borders separately with scalar code or clamped indexing.
- Separable filters: Gaussian blur and similar filters can be split into horizontal and vertical passes, reducing a 3x3 stencil from 9 to 6 operations. Each pass is still compute-bound.
- ARM portability: use
f32x4andi32x4for kernels that need to run on both x86 and ARM. The 128-bit types work on both architectures.
Text Processing
Text processing benefits from SIMD when you can skip large regions of uninteresting bytes. The key pattern is chunk-skip: load 32 bytes at a time, check if any byte matches a target character, and skip the entire chunk if none match.
The chunk-skip pattern
Most bytes in a text file are not the character you are looking for. A newline scanner over a 1 MB file might find 10,000 newlines among 1,000,000 bytes -- 99% of chunks contain no match and can be skipped with a single comparison and branch.
export func count_newlines(data: *u8, n: i32) -> i32 {
let newline: u8x32 = splat_u8x32(10)
let mut count: i32 = 0
let mut i: i32 = 0
while i < n - 31 {
let chunk: u8x32 = load_u8x32(data, i)
let matches: u8x32 = chunk .== newline
let mask: i32 = movemask(matches)
if mask != 0 {
let mut j: i32 = 0
while j < 32 {
if load_u8(data, i + j) == 10 {
count = count + 1
}
j = j + 1
}
}
i = i + 32
}
let mut k: i32 = i
while k < n {
if load_u8(data, k) == 10 {
count = count + 1
}
k = k + 1
}
count
}
The structure:
- Load 32 bytes as
u8x32 - Compare with
splat_u8x32(target)using.==to get a match vector movemask()collapses the vector comparison to a singlei32bitmask- If mask is 0: no matches in this chunk, skip ahead 32 bytes (fast path)
- If mask is nonzero: scan the 32 bytes individually (slow path, but rare)
- Scalar tail: handle remaining bytes that do not fill a full 32-byte chunk
The fast path processes 32 bytes in about 3 instructions. On typical text, 90-99% of chunks take the fast path.
Why not extract individual bit positions?
Ea does not have shift operators (<<, >>), so you cannot extract individual bit positions from the movemask result. Instead, when a chunk contains matches, fall back to byte-by-byte scanning within that 32-byte window. This is still fast because:
- Hot chunks are rare (most text is not your target character)
- The 32-byte scan is a tight loop that fits in L1 cache
- The overall speedup comes from skipping cold chunks, not from optimizing hot ones
ARM portability
movemask is an x86-only intrinsic (it maps to vpmovmskb). On ARM/NEON, there is no equivalent single instruction. For portable kernels, write separate x86 and ARM versions:
- x86: use
u8x32+movemaskas shown above - ARM: use
u8x16with scalar fallback, or structure the algorithm to avoid needing a bitmask
See ARM / NEON for details on architecture-specific intrinsics.
CSV parsing
CSV parsing combines the chunk-skip pattern with state tracking. You need to find commas and newlines while respecting quoted fields -- a comma inside quotes is not a delimiter.
The approach:
- Scan for quote characters (
") using chunk-skip to track quote state - Scan for delimiters (
,and\n) using chunk-skip, filtering by quote state - Parse numeric fields from the delimited regions
This is compute-bound because each byte potentially involves comparison against multiple target characters plus state logic -- exactly the kind of branching work that NumPy cannot vectorize.
For a complete CSV statistics package built on this pattern, see eastat.
Tips
- Buffer alignment: SIMD loads from unaligned addresses work but may be slower on older hardware. If you control the buffer, align to 32 bytes.
- Tail handling: always include a scalar loop for the last
n % 32bytes. Forgetting the tail is the most common bug in SIMD text processing. - Multi-character search: to find any of several characters (e.g.,
<,>,&for HTML), do multiple.==comparisons and combine with.|before the movemask.
ML Preprocessing
ML preprocessing pipelines often apply the same sequence of operations to every element in large arrays: normalize, scale, compute similarities. These are natural fits for Ea kernels because they fuse multiple operations into single-pass loops.
Normalize: zero-mean, unit-variance
The standard normalization (x - mean) / std requires two operations per element. NumPy computes them as separate passes:
normalized = (data - mean) / std # 2 passes, 2 temporaries
Ea fuses this into one pass using FMA. Rewrite (x - mean) / std as x * (1/std) + (-mean/std):
export func normalize(
src: *f32, dst: *mut f32,
inv_std: f32, neg_mean_div_std: f32,
n: i32
) {
let s: f32x8 = splat(inv_std)
let b: f32x8 = splat(neg_mean_div_std)
let mut i: i32 = 0
while i < n {
let v: f32x8 = load(src, i)
let result: f32x8 = fma(v, s, b)
store(dst, i, result)
i = i + 8
}
}
The caller precomputes inv_std = 1.0 / std and neg_mean_div_std = -mean / std in Python. The kernel then does a single FMA per 8 elements -- one instruction that multiplies and adds simultaneously.
Dot product: dual-accumulator pattern
A naive dot product uses one accumulator:
let mut acc: f32x8 = splat(0.0)
while i < n {
let a: f32x8 = load(x, i)
let b: f32x8 = load(y, i)
acc = fma(a, b, acc)
i = i + 8
}
This leaves performance on the table. FMA has a latency of 4-5 cycles but a throughput of 1 per cycle. With one accumulator, each FMA waits for the previous one to finish.
Dual accumulators hide this latency by interleaving independent FMA chains:
export func dot_product(x: *f32, y: *f32, out: *mut f32, n: i32) {
let mut acc0: f32x8 = splat(0.0)
let mut acc1: f32x8 = splat(0.0)
let mut i: i32 = 0
while i < n - 15 {
let a0: f32x8 = load(x, i)
let b0: f32x8 = load(y, i)
acc0 = fma(a0, b0, acc0)
let a1: f32x8 = load(x, i + 8)
let b1: f32x8 = load(y, i + 8)
acc1 = fma(a1, b1, acc1)
i = i + 16
}
while i < n {
let a: f32x8 = load(x, i)
let b: f32x8 = load(y, i)
acc0 = fma(a, b, acc0)
i = i + 8
}
let combined: f32x8 = acc0 .+ acc1
let result: f32 = horizontal_sum(combined)
store(out, 0, splat(result))
}
The CPU can execute fma(a0, b0, acc0) and fma(a1, b1, acc1) in parallel because they use different accumulators. This typically doubles throughput on modern CPUs.
Cosine similarity
Cosine similarity needs three reductions in one pass: dot(a, b), norm(a), and norm(b). Computing these separately means three passes over the data. Ea fuses all three:
export func cosine_similarity(a: *f32, b: *f32, out: *mut f32, n: i32) {
let mut dot_acc: f32x8 = splat(0.0)
let mut norm_a_acc: f32x8 = splat(0.0)
let mut norm_b_acc: f32x8 = splat(0.0)
let mut i: i32 = 0
while i < n {
let va: f32x8 = load(a, i)
let vb: f32x8 = load(b, i)
dot_acc = fma(va, vb, dot_acc)
norm_a_acc = fma(va, va, norm_a_acc)
norm_b_acc = fma(vb, vb, norm_b_acc)
i = i + 8
}
let dot_val: f32 = horizontal_sum(dot_acc)
let norm_a_val: f32 = horizontal_sum(norm_a_acc)
let norm_b_val: f32 = horizontal_sum(norm_b_acc)
let result: f32 = dot_val / sqrt(norm_a_val * norm_b_val)
store(out, 0, splat(result))
}
Three FMAs per loop iteration, all operating on the same loaded vectors. The data passes through cache once. NumPy would need np.dot(a, b), np.linalg.norm(a), and np.linalg.norm(b) -- three separate passes.
Batch operations
For ML workloads, you often apply the same operation to many rows. The Python side handles the loop over rows, calling the Ea kernel for each:
import ea
lib = ea.load("similarity.ea")
for i in range(num_queries):
lib.cosine_similarity(query[i], database[j], result, dim)
The kernel handles the inner loop (over vector dimensions) with SIMD. The outer loop (over queries or rows) stays in Python. This is the right split -- the inner loop is where SIMD matters.
Real-world package
eavec implements vector similarity search using the dual-accumulator FMA pattern. It computes cosine similarity across a database of vectors, returning the top-k most similar results.
Beating NumPy's BLAS at Constant-Q Transform with 80 Lines of Ea
A SIMD kernel compiler, a DFT nobody asked for, and an honest benchmark.
The Problem
Audio spectrum visualizers typically use FFT. FFT gives you linearly-spaced frequency bins — great for math, terrible for music. Human pitch perception is logarithmic: the distance from C2 to C3 (one octave) matters as much as C5 to C6. But FFT allocates equal resolution across the entire spectrum, wasting detail in the bass and over-resolving the treble.
The Constant-Q Transform (CQT) solves this. Each frequency bin gets its own window length — long windows for low frequencies (more cycles needed to resolve pitch), short windows for high frequencies. The result: 84 bins spanning 7 octaves, one per semitone, C2 through B8.
The catch: FFT can't do CQT. You need a DFT-per-bin, which is O(n*k) instead of O(n log n). The standard NumPy approach is to build a complex kernel matrix and let BLAS handle it with a single matrix-vector multiply.
We wanted to see if a hand-written SIMD kernel in Ea could beat that.
What is Ea
Ea is a compute kernel compiler. You write .ea files in a C-like language with explicit SIMD vector types, compile them to native shared libraries, and call them from Python with NumPy arrays. No C toolchain, no Cython, no JIT warmup.
export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
let s: f32x8 = splat(factor)
let mut i: i32 = 0
while i < n {
let v: f32x8 = load(src, i)
store(dst, i, v .* s)
i = i + 8
}
}
Ea doesn't have sin or cos intrinsics. This is a deliberate design choice — trig is a policy decision (how many polynomial terms? what accuracy? what range?) that the language refuses to hide behind a simple-looking function call. If you need trig, you precompute tables in Python or write an explicit FMA polynomial chain.
This turns out to be the key insight for CQT.
The Design
A CQT with 84 bins (7 octaves, 12 semitones each) starting at C2 (65.4 Hz) at 44.1 kHz sample rate has these properties:
| Bin | Note | Frequency | Window Length |
|---|---|---|---|
| 0 | C2 | 65.4 Hz | 11,339 samples |
| 33 | A4 | 440 Hz | 1,684 samples |
| 83 | B8 | 7,903 Hz | 94 samples |
The quality factor Q = 16.8 (constant across all bins — that's what "constant-Q" means). Total work per frame: 200,476 FMA operations across all bins.
The approach: precompute cos/sin twiddle factor tables in Python (one-time cost, 1.6 MB), then let Ea handle the per-frame FMA loop. We bake the Hann window directly into the twiddle factors during precomputation:
# Python: precompute once
for k in range(n_bins):
n_k = int(lengths[k])
i = np.arange(n_k, dtype=np.float64)
window = 0.5 * (1 - np.cos(2 * np.pi * i / n_k))
angle = 2 * np.pi * freqs[k] * i / SAMPLE_RATE
cos_parts.append((window * np.cos(angle)).astype(np.float32))
sin_parts.append((window * -np.sin(angle)).astype(np.float32))
This means the Ea kernel needs zero trig and zero windowing logic. The inner loop is pure FMA.
The Kernel
Version 1: Direct DFT with Dual Accumulators
The first kernel processes one frequency bin at a time. For each bin, it reads the audio segment, multiplies against precomputed cos and sin twiddle factors, accumulates, and computes the magnitude. Smoothing (exponential decay for the "falling bars" effect) is fused into the output — no separate kernel call, no intermediate array.
Two independent FMA accumulator chains (r0/r1, i0/i1) hide pipeline latency on superscalar CPUs. This is the same technique used in high-performance reduction kernels.
export func cqt_fused(
audio: *f32,
cos_table: *f32,
sin_table: *f32,
offsets: *i32, // per-bin start offset into twiddle tables
lengths: *i32, // per-bin window length
prev_mags: *f32,
out_mags: *mut f32,
alpha: f32, // decay factor
n_bins: i32,
max_window_len: i32
) {
let mut k: i32 = 0
while k < n_bins {
let off: i32 = offsets[k]
let win_len: i32 = lengths[k]
let audio_start: i32 = max_window_len - win_len
let mut r0: f32x8 = splat(0.0)
let mut r1: f32x8 = splat(0.0)
let mut i0: f32x8 = splat(0.0)
let mut i1: f32x8 = splat(0.0)
let mut i: i32 = 0
while i + 16 <= win_len {
prefetch(cos_table, off + i + 64)
prefetch(sin_table, off + i + 64)
let s0: f32x8 = load(audio, audio_start + i)
let s1: f32x8 = load(audio, audio_start + i + 8)
let c0: f32x8 = load(cos_table, off + i)
let c1: f32x8 = load(cos_table, off + i + 8)
let sn0: f32x8 = load(sin_table, off + i)
let sn1: f32x8 = load(sin_table, off + i + 8)
r0 = fma(s0, c0, r0)
r1 = fma(s1, c1, r1)
i0 = fma(s0, sn0, i0)
i1 = fma(s1, sn1, i1)
i = i + 16
}
// 8-element and scalar tails omitted for brevity
let real: f32 = reduce_add(r0 .+ r1)
let imag: f32 = reduce_add(i0 .+ i1)
let mag: f32 = sqrt(real * real + imag * imag)
// Fused smooth decay
let decayed: f32 = prev_mags[k] * alpha
if mag > decayed {
out_mags[k] = mag
} else {
out_mags[k] = decayed
}
k = k + 1
}
}
The generated assembly for the hot loop is clean — the Ea compiler folds twiddle loads directly into vfmadd231ps memory operands, so the audio loads (vmovups) are the only explicit loads:
.LBB0_4:
prefetcht0 (%rsi,%r12,4) ; prefetch cos_table
prefetcht0 (%rdx,%r12,4) ; prefetch sin_table
vmovups (%rdi,%r12,4), %ymm5 ; load audio[i..i+7]
vmovups (%rdi,%r12,4), %ymm6 ; load audio[i+8..i+15]
vfmadd231ps (%rsi,%r12,4), %ymm5, %ymm3 ; r0 += audio * cos
vfmadd231ps (%rsi,%rbp,4), %ymm6, %ymm4 ; r1 += audio * cos
vfmadd231ps (%rdx,%r12,4), %ymm5, %ymm2 ; i0 += audio * sin
vfmadd231ps (%rdx,%rbp,4), %ymm6, %ymm1 ; i1 += audio * sin
Four FMAs per iteration, each operating on 8 floats. 64 floating-point multiply-adds per loop cycle.
First Benchmark: The Lie
Our first benchmark compared this against a Python for-loop CQT:
def numpy_cqt(audio, freqs, max_window_len):
magnitudes = np.empty(N_BINS, dtype=np.float32)
for k in range(N_BINS):
n_k = int(np.ceil(Q * SAMPLE_RATE / freqs[k]))
segment = audio[start:start + n_k]
window = 0.5 * (1 - np.cos(2 * np.pi * i / n_k))
kernel = window * np.exp(-2j * np.pi * freqs[k] * i / SAMPLE_RATE)
magnitudes[k] = np.abs(np.sum(segment * kernel))
return magnitudes
Result: 153x faster. We almost shipped this number.
The problem: this "NumPy CQT" is a Python for-loop with per-bin allocations. Nobody would write production CQT this way. The honest NumPy approach is to precompute a (84, 11339) complex kernel matrix and do a single np.dot:
# Precompute once: (n_bins, max_window_len) complex64 matrix
kernel_matrix = np.zeros((N_BINS, max_window_len), dtype=np.complex64)
for k in range(N_BINS):
# ... fill in windowed complex exponentials, zero-pad shorter bins
# Per frame: single BLAS call
magnitudes = np.abs(kernel_matrix @ audio)
This calls directly into OpenBLAS/MKL — decades of hand-tuned assembly for matrix-vector multiply.
Honest Benchmark v1
Methodology:
- All competitors precompute everything they can (tables, windows, indices)
- Only per-frame work is timed: transform + magnitude + smoothing
- Same smoothing (
max(current, previous * alpha)) applied to all
With N=500 iterations and inadequate warmup:
| Time | |
|---|---|
| Ea CQT | 0.098 ms |
| NumPy matmul | 0.085 ms |
NumPy was winning. The 153x headline collapsed to Ea being 1.1x slower.
Fixing the Benchmark
The N=500 result was noisy. WSL2 scheduling and insufficient warmup made the numbers unreliable. We moved to a min-of-trials methodology: 10 independent trials of 2,000 iterations each, reporting both min (best-case, no OS interference) and median (real-world scheduling).
With proper warmup (200+ iterations before timing):
| Min | Median | |
|---|---|---|
| Ea CQT | 50 us | 55 us |
| NumPy matmul | 95 us | 190 us |
Ea was already 1.9x faster at best, 3.5x at median. The initial "1.1x slower" was just noise.
Optimization Attempts
Quad Accumulators
Theory: 4 independent FMA chains (8 accumulator registers) should hide more pipeline latency than 2 chains.
Result: slower (100 us vs 83 us). With 8 YMM accumulator registers + 12 data registers for loads, we exceed x86-64's 16 YMM register limit. The compiler spills to stack memory, killing the very latency hiding we wanted.
Interleaved Loads
Theory: load cos, immediately FMA, then load sin and FMA. Give the memory subsystem more time to fetch by interleaving loads with compute.
Result: slower (191 us vs 53 us). The original pattern — batch all loads, then batch all FMAs — is better for out-of-order execution. The CPU's reorder buffer can schedule loads far ahead of their consumers when they're grouped together.
Merged Twiddle Table
Theory: interleave cos and sin data for each bin into a single contiguous memory region [cos_bin0 | sin_bin0 | cos_bin1 | ...] to halve cache line traffic.
Result: no change (53.8 us vs 53.1 us). The hardware prefetcher already handles two sequential streams efficiently. The second stream (sin_table) gets prefetched in parallel with the first (cos_table) because the access pattern is identical — sequential scan with the same stride.
In-Place Smoothing
Instead of writing to out_mags and copying back to prev_mags in Python, the kernel reads and writes the same buffer:
export func cqt_inplace(
audio: *f32,
cos_table: *f32,
sin_table: *f32,
offsets: *i32,
lengths: *i32,
mags: *mut f32, // read previous, write smoothed result
alpha: f32,
n_bins: i32,
max_window_len: i32
)
Result: ~5% faster (47.8 us vs 50.2 us). Modest win from eliminating a 336-byte numpy copy per frame plus one Python function call.
Final Numbers
Methodology: 10 trials of 2,000 iterations each, 200-iteration warmup per trial, min-of-trials reported. All precomputation excluded. All competitors include smoothing. NumPy uses pre-allocated output buffers (out= parameter) to avoid allocation overhead.
| Min | Median | |
|---|---|---|
| Ea CQT (fused SIMD) | ~50 us | ~55 us |
| NumPy CQT (BLAS matmul) | ~95 us | ~190 us |
| NumPy FFT (wrong result) | ~106 us | ~118 us |
Ea vs BLAS: 1.9x faster (min), 3.5x (median) Ea vs FFT: 2.1x faster — and FFT gives the wrong answer Throughput: 7.8 GFLOP/s
Why Ea Wins
1. Ea reads 4.8x less data. The BLAS matmul uses a dense (84, 11339) complex64 matrix — 7.4 MB per frame, including all the zero-padded regions where shorter bins have no data. Ea's variable-length inner loops skip zeros entirely, touching only 200,476 real elements (1.6 MB).
2. Real vs complex arithmetic. BLAS operates on complex64 (pairs of float32). Every BLAS FMA processes a complex multiply-add — 4 real multiplies and 2 real adds per element. Ea works directly in float32, doing 2 real FMAs per element (one for the cos component, one for sin).
3. Fused pipeline.
Ea does window + transform + magnitude + smoothing in one function call. NumPy requires 4 separate calls: matmul, abs, maximum, copyto. Each call crosses the Python/C boundary, allocates or writes to buffers, and makes a separate pass over the output data.
4. Lower scheduling jitter. One ctypes call has less interrupt surface than four NumPy calls. This explains the median gap (3.5x) being larger than the min gap (1.9x) — Ea's single kernel call is less likely to be interrupted mid-computation.
Why Ea Doesn't Win More
The dual-accumulator inner loop is already close to optimal for AVX2. The generated assembly has 4 vfmadd231ps instructions per iteration with memory operands — the compiler is folding loads into FMAs. There's no instruction waste.
The bottleneck is memory bandwidth. At 50 us for 1.6 MB of twiddle table reads, we're pulling ~32 GB/s — respectable for L3 cache, but below the theoretical memory bandwidth. The per-bin overhead (loading offsets/lengths, setting up accumulators, horizontal reduction) fragments what could be one continuous stream into 84 separate sweeps.
BLAS doesn't have this problem. Its matmul is one uninterrupted sweep through a dense matrix, which modern CPUs and prefetchers are specifically optimized for. BLAS trades compute (touching zeros) for access pattern regularity — and on some runs, when the OS cooperates perfectly, that trade nearly pays off.
The Code
Three files. The full kernel (cqt.ea, 80 lines), the Python visualizer, and the benchmark. The kernel uses the language as intended: explicit types, explicit vector widths, explicit memory access. No hidden costs.
import ea
import numpy as np
kernel = ea.load("cqt.ea")
# Precompute tables once
freqs = F_MIN * 2 ** (np.arange(84) / 12)
lengths = np.ceil(Q * SAMPLE_RATE / freqs).astype(np.int32)
# ... build cos/sin twiddle tables with baked-in Hann window ...
# Per frame: one call, ~50 us
kernel.cqt_fused(
audio_buffer, cos_table, sin_table,
offsets, lengths, prev_mags, out_mags,
decay_alpha, n_bins, max_window_len,
)
The full source is on GitHub.
What We Learned
The first benchmark said 153x. That was a lie — comparing optimized SIMD against a Python for-loop. The second benchmark said 1.1x slower. That was noise — insufficient warmup on a noisy WSL2 scheduler. The real number is 1.9x faster, earned by reading less data, using simpler arithmetic, and fusing the pipeline.
Three optimization attempts failed (quad accumulators, interleaved loads, merged tables) and one gave a modest 5% win (in-place smoothing). The lesson: when the compiler is already generating clean assembly, micro-optimizations rarely help. The wins came from the algorithm design — variable-length windows, baked-in windowing, fused smoothing — not from squeezing the inner loop.
Ea didn't win by being a faster compiler. It won by making it easy to write the right algorithm. A kernel that skips zeros, fuses four operations, and makes one function call will beat a kernel that touches every element of a zero-padded matrix and makes four function calls — even when the latter is backed by BLAS.
How 30 Lines of Eä Beat NumPy by 6×
And why your framework is probably slower than a for-loop.
The Pitch
Here's a fused multiply-add: out[i] = a[i] * b[i] + c[i]. Sixteen million times.
NumPy does it in 46 milliseconds. Eä does it in 7 milliseconds. That's 6.6× faster.
Here's the kicker: the Eä kernel is 30 lines. The NumPy version is one line. And the one-liner loses because it's actually two lines pretending to be one.
The NumPy Version
out = a * b + c
Simple. Elegant. Two full scans of 64 MB of data.
NumPy computes a * b first, writes 64 MB to a temporary array, then reads it back to add c. That's 256 MB of memory traffic for 192 MB of actual data. Every element gets loaded, stored, loaded again, and stored again.
On a modern CPU at ~35 GB/s memory bandwidth, that's a floor of about 7 milliseconds just for memory. NumPy hits 46 ms because it also has Python dispatch overhead, array allocation, and — crucially — it can't fuse the multiply and add into a single instruction.
The Eä Version
export func fma_f32x8(
a: *restrict f32,
b: *restrict f32,
c: *restrict f32,
out: *mut f32,
n: i32
) {
let mut acc0: f32x8 = splat(0.0)
let mut acc1: f32x8 = splat(0.0)
let mut i: i32 = 0
while i + 16 <= n {
let va0: f32x8 = load(a, i)
let vb0: f32x8 = load(b, i)
let vc0: f32x8 = load(c, i)
store(out, i, fma(va0, vb0, vc0))
let va1: f32x8 = load(a, i + 8)
let vb1: f32x8 = load(b, i + 8)
let vc1: f32x8 = load(c, i + 8)
store(out, i + 8, fma(va1, vb1, vc1))
i = i + 16
}
while i < n {
out[i] = a[i] * b[i] + c[i]
i = i + 1
}
}
One pass. Load a, b, c, fuse multiply-add in a single vfmadd instruction, store result. Each element is touched once. Total memory traffic: 256 MB (same 4 arrays), but only one scan instead of two.
The *restrict tells LLVM the arrays don't alias, enabling aggressive optimization. The f32x8 processes 8 floats per instruction on AVX2. The 2× unroll (processing 16 elements per iteration) hides memory latency.
That's it. No magic. Just not doing twice the work.
"But I'll Just Use Ray"
We benchmarked that too. For science.
| Method | Time | Throughput | vs NumPy |
|---|---|---|---|
| NumPy (multiply + add) | 46,000 µs | 5.6 GB/s | baseline |
| Eä (1 thread) | 6,900 µs | 37.0 GB/s | 6.6× |
| Eä (2 threads) | 6,500 µs | 39.1 GB/s | 7.0× |
| Dask (2 chunks) | 56,000 µs | 4.6 GB/s | 0.8× |
| Ray (2 workers) | 89,000 µs | 2.9 GB/s | 0.5× |
Ray is twice as slow as NumPy. For FMA. On two cores. On the same machine.
Why? Serialization. Ray pickles your 64 MB arrays, sends them to worker processes, unpickles them, runs NumPy (which does two passes), pickles the results, and sends them back. The actual compute takes 46 ms. The overhead takes another 43 ms.
Dask is better — it chunks lazily and uses NumPy under the hood — but it still can't fuse operations or control SIMD width. It's paying for an abstraction layer over code that's already fast enough.
The uncomfortable truth: for single-machine numerical work, a for-loop that touches memory once will beat a distributed framework that touches memory twice.
The Dot Product Story
We weren't always winning. Our first benchmark had Eä's dot product at 0.27× of BLAS. Embarrassing.
// The naive version — DO NOT ship this
export func dot_naive(a: *f32, b: *f32, n: i32) -> f32 {
let mut acc: f32 = 0.0
let mut i: i32 = 0
while i < n {
acc = acc + a[i] * b[i]
i = i + 1
}
return acc
}
Single scalar accumulator. Each iteration depends on the previous one. The CPU stalls waiting for the addition to complete before it can start the next multiply. Classic dependency chain bottleneck.
BLAS uses the trick every HPC programmer knows: multiple independent accumulators.
// The optimized version — ships in autoresearch/kernels/dot_product/
export func dot_f32x8(a: *restrict f32, b: *restrict f32, len: i32) -> f32 {
let mut acc0: f32x8 = splat(0.0)
let mut acc1: f32x8 = splat(0.0)
let mut i: i32 = 0
while i + 32 <= len {
acc0 = fma(load(a, i), load(b, i), acc0)
acc1 = fma(load(a, i + 8), load(b, i + 8), acc1)
acc0 = fma(load(a, i + 16), load(b, i + 16), acc0)
acc1 = fma(load(a, i + 24), load(b, i + 24), acc1)
i = i + 32
}
// ... tail handling ...
return reduce_add_fast(acc0 .+ acc1)
}
Two f32x8 accumulators (acc0, acc1) with 4× unroll. The CPU has 16 FMA operations in flight that don't depend on each other. The result:
| Method | Time | GB/s | vs BLAS |
|---|---|---|---|
| NumPy BLAS sdot | 3,535 µs | 35.9 | baseline |
| Eä naive (scalar) | 13,222 µs | 9.7 | 0.27× |
| Eä f32x4 (1 acc) | 4,474 µs | 28.6 | 0.79× |
| Eä f32x8 (dual acc, 4× unroll) | 3,500 µs | 36.6 | 1.01× |
From 0.27× to 1.01× — by changing the loop structure, not the algorithm. Both versions compute the same dot product. The fast one just asks the CPU to think about 32 elements instead of 1.
What You're Actually Seeing
This isn't "Eä is fast." Eä is a thin wrapper around LLVM. The FMA kernel compiles to about 12 assembly instructions in the inner loop. Eä's job is to make writing those 12 instructions feel like writing 30 lines of readable code instead of 200 lines of intrinsics.
The real insight is about NumPy's cost model:
| Operation | NumPy | Eä |
|---|---|---|
a * b + c | 2 passes over data | 1 pass (fused FMA) |
| Temporary arrays | 1 allocation (64 MB) | 0 allocations |
| SIMD width | "whatever the compiler picks" | explicit f32x8 |
| Memory round-trips | 2 (multiply result → RAM → add) | 0 |
NumPy is fast per operation. But real workloads aren't one operation — they're pipelines. And every np. call is a full scan of your data.
When Eä Doesn't Win
Bandwidth-bound operations with a single arithmetic op per element. Scaling an array:
out = src * 2.5
NumPy: 5,631 µs. Eä: 6,138 µs. NumPy wins (by 8%).
Both saturate memory bandwidth at ~22 GB/s. There's nothing to fuse. One multiply per element loaded. No amount of SIMD cleverness makes DRAM faster.
This is the honest dividing line: if your inner loop does fewer than ~4 operations per element, NumPy is fast enough. If it does more, you're leaving performance on the table.
Getting Started
pip install ea-compiler
import ea
kernel = ea.load("fma.ea")
import numpy as np
a = np.random.rand(16_000_000).astype(np.float32)
b = np.random.rand(16_000_000).astype(np.float32)
c = np.random.rand(16_000_000).astype(np.float32)
out = np.zeros(16_000_000, dtype=np.float32)
kernel.fma_f32x8(a, b, c, out) # 7 ms instead of 46 ms
No Cython. No Numba. No C compiler. No JIT warmup. Write a .ea file, call ea.load(), pass NumPy arrays.
The compiler handles SIMD, the binding handles types, and the hardware handles the rest.
All benchmarks measured on a 2-core machine. Your numbers will be different, probably better (more cores = more bandwidth). The ratios hold. Source code and methodology on GitHub.
Eä is open-source (Apache 2.0). The compiler is ~12,000 lines of Rust with 475+ tests. Documentation. GitHub.
I Wrote a SIMD Compiler in 12K Lines of Rust
And then an LLM optimized the kernels better than I could.
The Elevator Pitch
Eä is a compiler for SIMD kernels. You write a small .ea file, run one command, and call it from Python like a normal function. But it runs at native vectorized speed.
import ea
kernel = ea.load("fma.ea")
result = kernel.fma_f32x8(a, b, c, out) # 6.6× faster than NumPy
No ctypes. No header files. No build system. The compiler generates the shared library and the Python wrapper. Also Rust, C++, PyTorch, and CMake bindings. It targets x86-64 (AVX2/AVX-512) and AArch64 (NEON).
The whole compiler is 12,000 lines of Rust. 475 tests. One person. I'm not a compiler engineer. But I had a very specific problem, and it turns out that's enough.
Why
I had a problem that kept repeating. I'd write something in Python, profile it, find a hot loop, and think: "this needs to be fast." And I knew that fast meant C. Fast meant SIMD. I don't have deep experience with either, but I knew that's where the performance lives.
So I'd fumble through some C code (with an LLM helping me write it, which honestly made it worse because now the code worked but I didn't fully understand why), fight with ctypes, spend an afternoon debugging pointer arithmetic that the AI got right the first time but I broke while "cleaning up," and eventually get a 5× speedup. Then next week, different project, same dance.
I didn't mind the hard part. Figuring out what the kernel should do, thinking about memory access patterns, deciding on vector widths. That's the interesting problem. What I minded was the plumbing. The header files. The build system. The ctypes declarations. The dtype validation. All of it boilerplate, all of it error-prone, none of it the actual work.
So I thought: what if a compiler could handle the plumbing? You write the kernel in a simple language, something that looks like the pseudocode you'd sketch on a whiteboard, and the compiler handles everything else. Compile to a shared library. Auto-generate the Python wrapper. One command. No Makefile.
The "no glue code" part turned out to be the product. Not the SIMD. Not the compiler. The fact that you go from .ea file to working Python function in one command, with types checked, lengths inferred, and output buffers allocated. That's what makes people actually use it instead of just reading about it.
I didn't know how to build a compiler. But I had the idea, and I wanted to see if it would work.
The First Attempt (10K Lines of Pain)
I should tell you about the compiler I wrote before this one.
It also targeted LLVM. It also generated SIMD code. The codegen was 10,000 lines in a single file. The parser was hand-written but handled way too many features. There were no hard limits on file size, no style rules, no test discipline. I kept adding things (generics, a module system, type inference) because why not?
The codebase became unmaintainable in about three months. I couldn't change anything without breaking something else. The codegen file had functions that called functions that called functions eight levels deep, and half of them were handling edge cases for features nobody used.
I threw it away.
The lesson: a SIMD kernel compiler doesn't need generics. It doesn't need modules. It doesn't need type inference. It needs to compile load, store, fma, and splat correctly, generate clean bindings, and stay small enough that one person can hold it in their head.
The Hard Rules
For the second attempt, I wrote the rules before I wrote the code:
- No file exceeds 500 lines. Split before you hit the limit.
- Every feature proven by end-to-end test. If it's not tested, it doesn't exist.
- No fake functions. If hardware doesn't support an operation, the compiler errors. No silent fallbacks.
- No premature features. Don't build what isn't needed yet.
- Delete, don't comment. Dead code gets removed.
The 500-line rule was the hardest to follow and the most valuable. It forced me to split the type checker into 7 files, the codegen into 10 files, and the parser into 4 files. Each file does one thing. When I need to change how store is type-checked, I open intrinsics_memory.rs (309 lines) and the answer is right there. No grepping through 10K lines of spaghetti.
Was it frustrating? Constantly. I'd be in the middle of adding a feature, hit 480 lines, and have to stop and refactor before I could finish. But every time I did, the code got better. The refactor always revealed something. A responsibility that should have been split earlier. A function that was doing two things.
The "no premature features" rule was the other hard one. I kept wanting to add generics. Or a module system. Or traits. And every time, I'd ask myself: does this serve the goal of compiling SIMD kernels to callable shared libraries? The answer was always no. Eä is monomorphic by design. You write f32x8, you get f32x8. No hidden specialization, no surprise codegen, no combinatorial explosion of type instances.
It's not a limitation. It's the point.
The Architecture
.ea → Lexer (logos) → Parser → Desugar → Type Check → Codegen (LLVM 18) → .o / .so
→ .ea.json → ea bind
The most important insight: the desugarer is the most important pass.
Eä has a kernel construct that looks like this:
export kernel vscale(data: *f32, out: *mut f32, factor: f32)
over i in n step 8
tail scalar { out[i] = data[i] * factor }
{
store(out, i, load(data, i) .* splat(factor))
}
The desugarer turns this into a plain function with a while-loop and a tail loop. After desugaring, there are no kernels in the AST. Just functions with loops. The type checker and codegen never see kernel. They only see func.
This means every downstream pass is simpler. The type checker doesn't need special kernel logic. The codegen doesn't need to handle iteration. The desugarer handles all of it: injecting the n parameter, generating the loop variable, building the main loop with step, building the tail loop with the chosen strategy.
The desugar pass is 340 lines. It eliminates an entire class of complexity from the remaining 11,660 lines.
The Binding Generators
This is the part people don't expect. The compiler generates .ea.json metadata describing each exported function's signature. Then ea bind reads the JSON and generates idiomatic wrappers:
ea bind kernel.ea --python --rust --cpp --pytorch --cmake
The Python generator does something clever: length collapsing. If your kernel takes (data: *f32, n: i32), the generated Python function takes just data and fills n from data.size automatically. Output parameters marked with out get auto-allocated. The generated code checks dtypes, casts pointers, and handles all the ctypes plumbing that used to take me an afternoon.
Five binding generators, each 200-460 lines. No serde. The JSON parser is hand-written (65 lines). The generated code is clean enough that you can read it, modify it, and learn from it.
Error Messages (The Quiet Win)
I spent more time on error messages than on codegen.
The compiler has "did you mean?" suggestions with Levenshtein distance, dot-operator hints ("cannot use '+' on vectors, use '.+'"), let mut suggestions, type conversion hints, multi-character underlines showing the full expression span, and clear messages that never leak internal compiler state.
kernel.ea:5:12 error[type]: cannot use '+' on vectors. Use '.+' for element-wise vector operations
return a + b
^^^^^
Nobody notices good error messages. But everyone notices bad ones. The difference between a user who gives up and a user who fixes their code is often one helpful error message.
The Autoresearch System (The Fun Part)
This is where it gets interesting.
Inspired by Andrej Karpathy's autoresearch concept, I built an automated optimization loop: an LLM reads the kernel source, the benchmark results, and the history of what's been tried, then proposes a modified kernel. The system compiles it, benchmarks it across multiple data sizes (to catch cache-fitting illusions), verifies correctness against a C reference, and accepts or rejects the change. Then it iterates.
This is where I had the most fun building Eä.
The first time I ran it on the FMA kernel, it found a 10% improvement in 30 iterations. I thought the kernel was already as good as it could get. The LLM found that 12× unrolling with stream stores beat 4× unrolling with regular stores at DRAM scale. I wouldn't have tried that. It sounds like overkill.
Then I let it run on the matrix multiplication kernel. 56% improvement. It switched from ijk to ikj loop order with 8× k-unrolling. I've heard of loop tiling. I couldn't have told you when to apply it. The LLM didn't need to "know." It just tried it and the benchmark said yes.
The thing that surprised me most: you think you have an optimal kernel. You let the LLM iterate 5 times and it finds 20% improvement. Okay, fine, maybe it wasn't optimal. So you let it iterate 50 times on the already-improved kernel. And it still finds improvements. The search space for kernel optimization is bigger than your intuition.
27 benchmark kernels, all scored on largest-size (real-world) data with GB/s bandwidth metrics. The system includes bottleneck classification that tells the LLM whether a kernel is DRAM-bound (don't bother with compute tricks), compute-bound (try wider SIMD, more accumulators), or mixed. The biggest wins:
| Kernel | Improvement | What Changed |
|---|---|---|
| Bitonic sort | 97% | Replaced O(n²) Shellsort with sorting network |
| Matmul | 56% | k×8 unroll, cache-friendly access |
| Conv2d 3×3 | 47% | 4× column unroll, prefetch, restrict |
| Edge detect | 41% | f32x4 to f32x8 upgrade |
The humbling part: I'm not an optimization expert. But it turns out you don't need to be one. You need a benchmark harness, a correctness check, and a system that's willing to try things you wouldn't think of.
The Numbers
Source: 12,000 lines of Rust
Tests: 475 end-to-end
Test method: compile Eä → link with C → run binary → compare stdout
CI: x86-64, AArch64 (native), Windows
LLVM backend: 18.1 via inkwell 0.8
Binding targets: Python, Rust, C++, PyTorch, CMake
Performance on a real workload (16M float32 elements):
FMA (fused multiply-add): 6.6× faster than NumPy 37.0 GB/s
Dot product: matches BLAS 36.6 GB/s
SAXPY: 2.1× faster than NumPy 35.2 GB/s
That's ~37 GB/s on a system with ~40 GB/s memory bandwidth. Near the hardware limit. There's not much room left, which means the code is doing roughly as little unnecessary work as possible.
What I'd Do Differently
Start with the binding generator. I built the compiler first and added bindings later. But the bindings are what make Eä useful. If I'd started by designing the ideal Python API and worked backward to the compiler, some early decisions would have been different.
Add ea inspect earlier. The instruction analysis tool that shows you vector/scalar instruction counts, FMA operations, load/store ratio, and performance hints. I added it late, but it would have caught optimization issues months earlier.
Write fewer features, sooner. Eä has kernel, foreach, for, while, structs, output annotations, conditional compilation, static assertions, and 30+ intrinsics. Most users need kernel, load, store, fma, and splat. I should have shipped a useful subset earlier and iterated based on real usage.
The Real Story
I'm not a compiler engineer. I don't have a CS degree. I'm the kind of person who has ideas and wants to see if they work.
What changed is the tooling. I built Eä with the help of AI models. Claude for the heavy lifting, my own judgment for the architecture and design decisions. The hard rules came from me (learned the painful way from the first attempt). The implementation speed came from having a capable coding assistant.
A year ago, this would have required a team. Now it's 12,000 lines and one person.
The interesting question isn't "can you build this?" anymore. It's what are you going to build next?
My advice: if you have an idea that feels too ambitious for one person, the calculus has changed. Try it. Set hard rules so the codebase stays manageable. Write tests for everything. And don't be afraid to throw away your first attempt.
I threw away 10K lines of bad compiler and started over. Best decision I made.
Eä is open-source under Apache 2.0. GitHub · Documentation · pip install ea-compiler