Make Python Fast in 60 Seconds

Eä is a compute kernel compiler. You write small, focused compute functions — Eä compiles them to native code and gives you Python functions that take NumPy arrays.

No C. No Cython. No Numba JIT warmup. Just fast code.

The pitch

pip install ea-compiler

Write a kernel (scale.ea):

export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
    let mut i: i32 = 0
    while i < n {
        dst[i] = src[i] * factor
        i = i + 1
    }
}

Call it from Python:

import ea
import numpy as np

kernel = ea.load("scale.ea")
src = np.random.randn(1_000_000).astype(np.float32)
dst = np.empty_like(src)
kernel.scale(src, dst, factor=2.0)

That's it. ea.load() compiles your kernel to a native shared library, caches it, and gives you a callable Python function. No Rust, no LLVM, no build step — just pip install and go.

What makes Eä different?

  • Explicit SIMD. You control the vector width. No auto-vectorization guessing.
  • No runtime. Compiles to plain .so/.dll files. Zero overhead beyond function call.
  • No magic. If your code looks like it processes 8 floats at a time, it does. No silent scalar fallback.
  • Instant bindings. Python, Rust, C++, PyTorch — generated from one source.

Next steps

Installation

pip install ea-compiler

This gives you the ea compiler and the Python ea.load() API. No other dependencies needed (besides NumPy).

Works on:

  • Linux x86_64
  • Linux aarch64 (ARM)
  • Windows x86_64

Verify installation

import ea
print(ea.__version__)           # e.g., "1.7.0"
print(ea.compiler_version())    # same, from the bundled binary

Building from source

For development or unsupported platforms, see the eacompute README for instructions on building the compiler from source. This requires Rust and LLVM 18.

Your First Kernel

Let's write a kernel that scales an array by a constant factor, then upgrade it to use SIMD.

Step 1: Scalar version

Create scale.ea:

export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
    let mut i: i32 = 0
    while i < n {
        dst[i] = src[i] * factor
        i = i + 1
    }
}

Key things to notice:

  • export makes the function callable from Python (C ABI)
  • *f32 is an immutable pointer to float32 — your input array
  • *mut f32 is a mutable pointer — your output array
  • n: i32 is the array length — the caller provides this
  • All types are explicit. No inference, no ambiguity.

Step 2: Call from Python

import ea
import numpy as np

kernel = ea.load("scale.ea")

src = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32)
dst = np.empty_like(src)
kernel.scale(src, dst, factor=2.0)
print(dst)  # [2. 4. 6. 8.]

ea.load() compiles scale.ea the first time, then caches the result in __eacache__/. Subsequent calls load from cache instantly.

Step 3: SIMD version

Now let's process 8 floats at a time:

export func scale_simd(src: *f32, dst: *mut f32, factor: f32, n: i32) {
    let s: f32x8 = splat(factor)
    let mut i: i32 = 0
    while i < n {
        let v: f32x8 = load(src, i)
        store(dst, i, v .* s)
        i = i + 8
    }
}

What changed:

  • f32x8 — a vector of 8 floats (256-bit, uses AVX2 on x86)
  • splat(factor) — broadcasts the scalar to all 8 lanes
  • load(src, i) — loads 8 consecutive floats starting at index i
  • v .* s — element-wise multiply (the dot prefix . means "vector operation")
  • store(dst, i, ...) — writes 8 floats back
  • We increment by 8, not 1

Important: This only works when n is a multiple of 8. For arbitrary lengths, see Kernels for tail-handling strategies.

Step 4: Compare performance

import ea
import numpy as np
import time

kernel = ea.load("scale.ea")
src = np.random.randn(10_000_000).astype(np.float32)
dst = np.empty_like(src)

# Eä scalar
start = time.perf_counter()
for _ in range(100):
    kernel.scale(src, dst, factor=2.0)
ea_scalar = (time.perf_counter() - start) / 100

# Eä SIMD
start = time.perf_counter()
for _ in range(100):
    kernel.scale_simd(src, dst, factor=2.0)
ea_simd = (time.perf_counter() - start) / 100

# NumPy
start = time.perf_counter()
for _ in range(100):
    np.multiply(src, 2.0, out=dst)
numpy_time = (time.perf_counter() - start) / 100

print(f"Eä scalar: {ea_scalar*1000:.2f} ms")
print(f"Eä SIMD:   {ea_simd*1000:.2f} ms")
print(f"NumPy:     {numpy_time*1000:.2f} ms")

For this simple operation, all three will be similar — it's bandwidth-bound (one operation per element loaded). Eä shines on compute-bound workloads where there are multiple operations per element. See the Cookbook for real-world comparisons.

Why Ea

What Ea is

Ea is a compute kernel compiler. You write small, focused numerical routines in Ea's explicit syntax, compile them to native shared libraries (.so on Linux, .dll on Windows), and call them from Python, Rust, C++, or PyTorch via C ABI. Ea is not a general-purpose programming language. It has no standard library, no garbage collector, no runtime. It compiles kernels.

The problem

Python is slow for tight numerical loops. When you need to run a custom algorithm over millions of elements -- a stencil, a reduction with custom logic, a chain of fused multiply-adds -- you hit a wall. The usual options are:

  • NumPy: fast for operations it supports, but custom multi-step algorithms require chaining calls that allocate intermediates and re-scan memory.
  • Numba: JIT-compiles Python, but debugging is difficult, compilation is unpredictable, and SIMD vectorization is implicit (you hope the compiler figures it out).
  • Cython: requires learning a hybrid language, managing build systems, and still gives you limited control over vectorization.
  • Writing C extensions: full control, but high friction. Manual Python/C bridging, header files, build scripts.

Ea targets the gap: you want native-speed kernels with explicit SIMD, but you do not want to maintain a C build system.

The philosophy

Ea is built on one principle: explicit over implicit.

  • If you write f32x8, you get 8-wide SIMD. The compiler will not silently fall back to scalar code.
  • If hardware does not support an operation (e.g., scatter without AVX-512), the compiler errors. It does not emit slow scalar code behind your back.
  • All types are explicit. No type inference. You see exactly what is happening.
  • All memory is caller-provided. Ea kernels never allocate. Pointers come from the host language.

There are no hidden performance cliffs. The code you write is the code that runs.

When to use Ea

Ea excels at compute-bound workloads where you do significant work per element loaded from memory:

  • Stencil operations: convolutions, blurs, edge detection -- each output element reads from multiple inputs.
  • Fused multiply-add chains: polynomial evaluation, IIR filters, dot products.
  • Custom reductions: computing statistics, finding patterns, accumulating results with non-trivial logic.
  • Particle simulations: N-body interactions, force calculations.
  • Image processing pipelines: per-pixel math with multiple operations fused into one pass.

The common thread: you load data, do many arithmetic operations on it, then store results. The CPU spends most of its time computing, not waiting for memory.

When NOT to use Ea

Ea is the wrong tool for:

  • Bandwidth-bound workloads: if you are just adding two arrays element-wise, NumPy already saturates memory bandwidth. Ea cannot make memory faster.
  • General programming: no strings, no file I/O, no networking, no data structures beyond structs.
  • Prototyping: write your algorithm in Python first, profile it, then port the hot loop to Ea.
  • GPU workloads: Ea targets CPUs (x86-64 with AVX2/AVX-512, AArch64 with NEON).

A good rule of thumb: if your inner loop does fewer than 4 arithmetic operations per element loaded, NumPy is probably fast enough.

The compilation model

kernel.ea  -->  ea --lib  -->  kernel.so + kernel.ea.json
                                   |
                          ea bind --python
                                   |
                              kernel.py  (generated wrapper)

One .ea file is one compilation unit. There are no imports, no modules. If you need to compose kernels, you do it at the C level -- each kernel is an independent shared library with a C ABI entry point.

The generated .ea.json metadata file describes the function signatures. The ea bind command reads it to generate idiomatic wrappers for your target language.

Quick taste

Here is a complete Ea kernel that scales an array of floats using 8-wide SIMD:

export kernel vscale(data: *f32, out: *mut f32, factor: f32)
    over i in n step 8
    tail scalar {
        out[i] = data[i] * factor
    }
{
    let vf: f32x8 = splat(factor)
    store(out, i, load(data, i) .* vf)
}

Compile and use from Python:

ea kernel.ea --lib
ea bind kernel.ea --python
import numpy as np
from kernel import vscale

data = np.array([1.0, 2.0, 3.0, 4.0, 5.0], dtype=np.float32)
result = vscale(data, factor=3.0)
# result: [3.0, 6.0, 9.0, 12.0, 15.0]

The main body processes 8 elements at a time with SIMD. The tail scalar block handles any remainder elements one at a time. The n parameter (the array length) is automatically injected into the function signature from the over i in n clause.

Language Basics

This page covers Ea's scalar language features. For SIMD vector types and operations, see SIMD.

Scalar types

Ea has fixed-size numeric types. No type inference -- you always write the type explicitly.

TypeDescription
i8, i16, i32, i64Signed integers
u8, u16, u32, u64Unsigned integers
f32, f64Floating point
boolBoolean (true / false)

Variables

All variables must have an explicit type annotation. Variables are immutable by default.

let x: i32 = 5
let y: f32 = 3.14
let flag: bool = true

To make a variable mutable, use mut:

let mut counter: i32 = 0
counter = counter + 1

Constants

Compile-time constants use const:

const PI: f32 = 3.14159
const BATCH_SIZE: i32 = 256
const EPSILON: f64 = 1e-9

Constants can be used in static_assert for compile-time checks:

const STEP: i32 = 8
static_assert(STEP > 0, "step must be positive")

Arithmetic and comparison

Standard arithmetic operators work on all numeric types:

let a: i32 = 10 + 3    // 13
let b: i32 = 10 - 3    // 7
let c: i32 = 10 * 3    // 30
let d: i32 = 10 / 3    // 3 (integer division)
let e: i32 = 10 % 3    // 1 (remainder)

Comparison operators return bool:

let lt: bool = a < b
let gt: bool = a > b
let le: bool = a <= b
let ge: bool = a >= b
let eq: bool = a == b
let ne: bool = a != b

Logical operators

Ea uses words, not symbols, for logical operations:

let both: bool = a > 0 and b > 0
let either: bool = a > 0 or b > 0
let neither: bool = not (a > 0 or b > 0)

Control flow

if / else if / else

if x > 0 {
    println(1)
} else if x == 0 {
    println(0)
} else {
    println(-1)
}

while loops

let mut i: i32 = 0
while i < n {
    out[i] = data[i] * 2
    i = i + 1
}

for loops

Counted loops with an explicit step:

for i in 0..n step 1 {
    out[i] = data[i] * 2
}

The step is required. The range 0..n is half-open: it includes 0 but excludes n.

foreach loops

A simpler counted loop when the step is always 1:

foreach (i in 0..n) {
    out[i] = data[i] * 2
}

Loop unrolling

Wrap a loop in unroll(N) to unroll it at compile time:

unroll(4) {
    foreach (j in 0..4) {
        out[base + j] = data[base + j] * factor
    }
}

Functions

Functions are declared with func. All parameter and return types are explicit:

func square(x: f32) -> f32 {
    return x * x
}

func add(a: i32, b: i32) -> i32 {
    return a + b
}

Functions without a return type return nothing (void):

func fill(out: *mut i32, n: i32, val: i32) {
    foreach (i in 0..n) {
        out[i] = val
    }
}

Exported functions

To make a function callable from C/Python/Rust, prefix it with export:

export func dot_product(a: *f32, b: *f32, n: i32) -> f32 {
    let mut sum: f32 = 0.0
    foreach (i in 0..n) {
        sum = sum + a[i] * b[i]
    }
    return sum
}

Only export func (and export kernel) produce symbols visible from outside. Non-exported functions are internal helpers.

Pointers

Pointers are how kernels receive data from the host language. There are four pointer variants:

SyntaxMeaning
*TRead-only pointer
*mut TMutable pointer (can write through it)
*restrict TRead-only, no aliasing (enables optimizations)
*restrict mut TMutable, no aliasing

Pointer indexing

Read from a pointer with bracket indexing:

let val: f32 = data[i]    // data is *f32

Write through a mutable pointer:

out[i] = val              // out is *mut f32

Type casts

Explicit casts convert between numeric types:

let x: i32 = 42
let f: f32 = to_f32(x)       // 42.0
let d: f64 = to_f64(x)       // 42.0
let back: i32 = to_i32(f)    // 42
let wide: i64 = to_i64(x)    // 42

There are no implicit conversions. Mixing types without a cast is a compile error.

println

println is the only output primitive. It exists for debugging:

println(42)
println(3.14)
println(true)
println("hello")

It accepts integers, floats, bools, and string literals. It does not support format strings.

What does not exist

Ea is deliberately minimal. The following features do not exist and are not planned:

  • No generics or templates
  • No traits or interfaces
  • No modules or imports
  • No heap allocation
  • No strings (except literal arguments to println)
  • No semicolons (statements are newline-separated)
  • No closures or lambdas
  • No enums or pattern matching
  • No exceptions or error handling

One file, one compilation unit. Compose at the C level.

SIMD

SIMD (Single Instruction, Multiple Data) lets you process multiple values in a single CPU instruction. Ea gives you direct control over SIMD vectors -- what you write is what the CPU executes.

Vector types

A vector type holds a fixed number of elements of the same scalar type. The name encodes both the element type and the lane count:

TypeElementsBitsx86 requirementARM requirement
f32x44 x f32128SSENEON
i32x44 x i32128SSENEON
u8x1616 x u8128SSENEON
i8x1616 x i8128SSENEON
i16x88 x i16128SSENEON
f64x22 x f64128SSE2NEON
f32x88 x f32256AVX2not available
i32x88 x i32256AVX2not available
u8x3232 x u8256AVX2not available
f64x44 x f64256AVX2not available
f32x1616 x f32512AVX-512not available

If you use f32x8 on a machine without AVX2, or on ARM, the compiler will error. No silent scalar fallback.

To enable 512-bit vectors, compile with --avx512:

ea kernel.ea --lib --avx512

Dot operators

Element-wise vector operations use dot-prefixed operators. This distinguishes them from scalar operations and makes SIMD explicit in the source:

OperatorMeaning
.+Element-wise add
.-Element-wise subtract
.*Element-wise multiply
./Element-wise divide
.>Element-wise greater than (returns bool vector)
.<Element-wise less than
.>=Element-wise greater or equal
.<=Element-wise less or equal
.==Element-wise equal
.!=Element-wise not equal
.&Element-wise bitwise AND
.|Element-wise bitwise OR
.^Element-wise bitwise XOR
.<<Element-wise shift left
.>>Element-wise shift right (logical for unsigned, arithmetic for signed)

Example:

let a: f32x4 = load(data, i)
let b: f32x4 = load(data, i + 4)
let sum: f32x4 = a .+ b
let product: f32x4 = a .* b
let mask: f32x4 = a .> b

Creating vectors

splat

Broadcast a scalar value to all lanes:

let factor: f32 = 2.5
let vf: f32x8 = splat(factor)
// vf = [2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5]

Loading from memory

Load a vector from a pointer at an element offset:

let v: f32x8 = load(data, i)     // loads data[i..i+8]

The offset i is in elements, not bytes. load(data, 8) loads elements 8 through 15.

There are also typed load intrinsics when you want to be explicit about the element type:

let v: f32x8 = load_f32(data, i)
let v: f32x4 = load_f32x4(data, i)

Storing to memory

Write a vector to a mutable pointer:

store(out, i, result)     // writes result to out[i..i+8]

Element access

Read individual lanes from a vector by index:

let v: f32x4 = load(data, 0)
let first: f32 = v[0]
let second: f32 = v[1]

Conditional selection

select picks lanes from two vectors based on a mask:

let mask: f32x4 = a .> b
let result: f32x4 = select(mask, a, b)
// where a > b, take a; otherwise take b

This compiles to a single blend instruction. There is no branching.

Scalar vs SIMD comparison

Here is the same operation -- scaling an array -- written both ways.

Scalar (processes one element per iteration):

export func scale(data: *f32, out: *mut f32, factor: f32, n: i32) {
    foreach (i in 0..n) {
        out[i] = data[i] * factor
    }
}

SIMD (processes 8 elements per iteration):

export kernel vscale(data: *f32, out: *mut f32, factor: f32)
    over i in n step 8
    tail scalar {
        out[i] = data[i] * factor
    }
{
    let vf: f32x8 = splat(factor)
    store(out, i, load(data, i) .* vf)
}

The SIMD version does 8 multiplications in a single instruction. On a workload that is compute-bound (multiple operations per element), this translates to a proportional speedup.

Hardware targeting

By default, Ea targets AVX2 on x86-64 and NEON on AArch64. You can change this:

# Default (AVX2 on x86)
ea kernel.ea --lib

# Enable AVX-512
ea kernel.ea --lib --avx512

# Target a specific CPU
ea kernel.ea --lib --target=skylake

# Cross-compile for ARM
ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu

The compiler rejects code that requires features the target does not have. If you use f32x8 and target ARM, you get a compile error, not slow code.

Kernels

The kernel construct is Ea's main abstraction for writing vectorized loops with automatic tail handling. It is syntactic sugar -- the compiler transforms it into a plain function with a while-loop before any further compilation.

Basic syntax

export kernel name(params)
    over i in n step S
    tail strategy { tail_body }
{
    main_body
}
  • name: the kernel's name, becomes the C ABI symbol
  • params: function parameters (pointers, scalars)
  • over i in n: the loop variable i iterates from 0 to n
  • step S: how many elements the main body processes per iteration
  • tail strategy: how to handle remainder elements when n is not a multiple of S
  • main_body: the code that runs for each full chunk of S elements

The range variable (here n) is automatically injected as an i32 parameter into the function signature. You do not declare it in the parameter list.

How it desugars

A kernel like this:

export kernel scale(data: *f32, out: *mut f32, factor: f32)
    over i in n step 4
    tail scalar { out[i] = data[i] * factor }
{
    out[i] = data[i] * factor
    out[i + 1] = data[i + 1] * factor
    out[i + 2] = data[i + 2] * factor
    out[i + 3] = data[i + 3] * factor
}

becomes equivalent to:

export func scale(data: *f32, out: *mut f32, factor: f32, n: i32) {
    let mut i: i32 = 0
    while i + 4 <= n {
        out[i] = data[i] * factor
        out[i + 1] = data[i + 1] * factor
        out[i + 2] = data[i + 2] * factor
        out[i + 3] = data[i + 3] * factor
        i = i + 4
    }
    // tail: process remainder one at a time
    while i < n {
        out[i] = data[i] * factor
        i = i + 1
    }
}

The generated C signature is void scale(float*, float*, float, int) -- the n parameter appears last.

Tail strategies

The tail handles remainder elements when n is not evenly divisible by the step.

tail scalar

Process remainder elements one at a time. The tail body runs in a loop with step 1:

export kernel vscale(data: *f32, out: *mut f32, factor: f32)
    over i in n step 8
    tail scalar {
        out[i] = data[i] * factor
    }
{
    let vf: f32x8 = splat(factor)
    store(out, i, load(data, i) .* vf)
}

The main body uses 8-wide SIMD. The tail body is scalar code that handles 0 to 7 leftover elements. This is the most common tail strategy.

tail mask

Use masked load/store for the remainder. The tail body runs once (not in a loop) and must handle all remaining elements using masked operations:

export kernel vscale(data: *f32, out: *mut f32, factor: f32)
    over i in len step 4
    tail mask {
        let rem: i32 = len - i
        let vf: f32x4 = splat(factor)
        let v: f32x4 = load_masked(data, i, rem)
        store_masked(out, i, v .* vf, rem)
    }
{
    let vf: f32x4 = splat(factor)
    store(out, i, load(data, i) .* vf)
}

The rem variable tells the masked operations how many elements are valid. This avoids the scalar loop entirely but requires masked intrinsics.

tail pad

The caller guarantees the input length is a multiple of the step. No tail body is generated:

export kernel fill(out: *mut i32, val: i32)
    over i in n step 4
    tail pad
{
    out[i] = val
    out[i + 1] = val
    out[i + 2] = val
    out[i + 3] = val
}

This produces the most efficient code but shifts the responsibility to the caller. If n is not a multiple of the step, the kernel will skip the remaining elements.

No tail clause

If you omit the tail clause entirely, the kernel only runs the main body. Remaining elements are not processed:

export kernel double_it(data: *i32, out: *mut i32)
    over i in n step 1
{
    out[i] = data[i] * 2
}

With step 1, there is no remainder, so omitting the tail is safe. With larger steps, you must ensure n is always a multiple of the step, or accept that trailing elements are skipped.

Complete example

A SIMD dot product kernel that handles any input length:

export kernel dot(a: *f32, b: *f32, out: *mut f32)
    over i in n step 8
    tail scalar {
        out[0] = out[0] + a[i] * b[i]
    }
{
    let va: f32x8 = load(a, i)
    let vb: f32x8 = load(b, i)
    let products: f32x8 = va .* vb
    let sum: f32 = reduce_add(products)
    out[0] = out[0] + sum
}

The main body loads 8-wide vectors, multiplies them, reduces to a scalar sum, and accumulates into out[0]. The scalar tail handles any remaining 0-7 elements.

Structs

Structs in Ea are plain data containers with C-compatible memory layout. They have no methods, no constructors, no impl blocks. They exist so you can pass structured data between Ea kernels and host languages.

Defining a struct

struct Particle {
    x: f32,
    y: f32,
    mass: f32,
}

Fields can be any scalar type. The memory layout matches C struct layout, so a Particle in Ea is identical to:

typedef struct { float x; float y; float mass; } Particle;

Creating and accessing

Inside Ea functions, you create a struct with literal syntax and access fields with dot notation:

func main() {
    let p: Particle = Particle { x: 1.0, y: 2.0, mass: 10.0 }
    println(p.x)      // 1
    println(p.mass)    // 10
}

Mutable structs support field assignment:

func main() {
    let mut p: Particle = Particle { x: 0.0, y: 0.0, mass: 1.0 }
    p.x = 3.5
    p.y = 7.0
    println(p.x)    // 3.5
}

Struct pointers

In exported functions, structs are typically passed via pointer from the host language:

struct Point {
    x: f32,
    y: f32,
}

export func get_x(p: *Point) -> f32 {
    return p.x
}

export func set_point(p: *mut Point, nx: f32, ny: f32) {
    p.x = nx
    p.y = ny
}

Read-only pointers (*Point) allow field reads. Mutable pointers (*mut Point) allow field writes.

Arrays of structs

Pointer indexing works with struct arrays. Each element is a full struct:

struct Vec2 {
    x: f32,
    y: f32,
}

export func sum_x(vecs: *Vec2, n: i32) -> f32 {
    let mut total: f32 = 0.0
    let mut i: i32 = 0
    while i < n {
        total = total + vecs[i].x
        i = i + 1
    }
    return total
}

From C, this is called with a pointer to a contiguous array of structs:

typedef struct { float x; float y; } Vec2;
extern float sum_x(const Vec2*, int);

Vec2 vecs[] = { {1.0f, 10.0f}, {2.0f, 20.0f}, {3.0f, 30.0f} };
float result = sum_x(vecs, 3);  // 6.0

Passing from Python

Since Ea structs match C layout, you can use NumPy structured arrays or ctypes:

import numpy as np

particle_dtype = np.dtype([
    ('x', np.float32),
    ('y', np.float32),
    ('mass', np.float32),
])

particles = np.zeros(1000, dtype=particle_dtype)

The generated Python bindings handle the pointer passing automatically.

Limitations

  • No methods or impl blocks. Structs are data only.
  • No nested structs.
  • No generics. Write separate struct definitions for each concrete type.
  • Struct fields must be scalar types.

Common Intrinsics

This page covers the most frequently used intrinsics. For the complete list, see the Intrinsics Reference.

splat

Broadcast a scalar to all lanes of a vector:

let factor: f32 = 2.5
let vf: f32x8 = splat(factor)
// vf = [2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5]

Works with all vector types. The return type is inferred from the variable's type annotation.

load / store

Load a vector from a pointer at an element offset. Store writes a vector back:

let v: f32x8 = load(data, i)       // read 8 floats starting at data[i]
store(out, i, v)                    // write 8 floats starting at out[i]

Offsets are in elements, not bytes.

Typed loads

When you want to be explicit about the element type:

let v: f32x8 = load_f32(data, i)
let v: f32x4 = load_f32x4(data, i)

These are equivalent to plain load but make the element type visible in the source.

stream_store

Non-temporal store that bypasses the CPU cache. Use for write-only output buffers where you will not read the data back soon:

let result: f32x8 = a .* b
stream_store(out, i, result)

This avoids polluting the cache with output data, leaving more cache space for inputs. Only beneficial for large output arrays that will not be re-read immediately.

fma

Fused multiply-add: computes a * b + c in a single instruction with a single rounding (more accurate than separate multiply and add):

// Scalar
let result: f32 = fma(a, b, c)

// Vector
let va: f32x8 = load(a, i)
let vb: f32x8 = load(b, i)
let vc: f32x8 = load(c, i)
let result: f32x8 = fma(va, vb, vc)

Works on f32, f64, and all float vector types. Maps to the hardware FMA instruction.

reduce_add

Sum all lanes of a vector down to a scalar:

let v: f32x8 = load(data, i)
let sum: f32 = reduce_add(v)

Works on all integer and float vector types. Useful for dot products, reductions, and histogram accumulation.

reduce_max / reduce_min

Find the maximum or minimum value across all lanes:

let v: f32x4 = load(data, i)
let biggest: f32 = reduce_max(v)
let smallest: f32 = reduce_min(v)

Works on integer and float vector types.

select

Per-lane conditional: where the mask is true, take from a; where false, take from b:

let mask: f32x8 = a .> b
let result: f32x8 = select(mask, a, b)
// element-wise max(a, b)

Compiles to a blend instruction. No branching. This is how you write branchless SIMD code.

sqrt / rsqrt

Square root and reciprocal square root (1/sqrt):

let x: f32 = 16.0
let root: f32 = sqrt(x)       // 4.0

let v: f32x8 = load(data, i)
let roots: f32x8 = sqrt(v)
let inv_roots: f32x8 = rsqrt(v)

rsqrt uses the fast hardware approximation. Works on f32, f64, and float vector types.

min / max

Element-wise minimum and maximum. Works on both scalars and vectors:

// Scalar
let smaller: f32 = min(a, b)
let larger: f32 = max(a, b)

// Vector
let va: f32x8 = load(data, i)
let vb: f32x8 = load(data, j)
let mins: f32x8 = min(va, vb)
let maxs: f32x8 = max(va, vb)

movemask

Extract comparison results to an integer bitmask. x86 only:

let mask: f32x8 = a .> b
let bits: i32 = movemask(mask)
// bit k is 1 if lane k of a > lane k of b

Useful for branching on SIMD comparison results or counting matching elements. Each bit in the result corresponds to one vector lane.

Masked memory operations

For tail handling, masked loads and stores read/write only the valid lanes:

let rem: i32 = n - i
let v: f32x4 = load_masked(data, i, rem)    // load only 'rem' elements
store_masked(out, i, result, rem)            // store only 'rem' elements

The rem parameter specifies how many elements (starting from lane 0) are valid. Lanes beyond rem are zero-filled on load and not written on store.

Summary table

IntrinsicInputOutputDescription
splat(s)scalarvectorBroadcast to all lanes
load(ptr, i)pointer, offsetvectorLoad vector from memory
store(ptr, i, v)pointer, offset, vectorvoidWrite vector to memory
stream_store(ptr, i, v)pointer, offset, vectorvoidNon-temporal write
fma(a, b, c)3 valuessame typea * b + c fused
reduce_add(v)vectorscalarSum all lanes
reduce_max(v)vectorscalarMax across lanes
reduce_min(v)vectorscalarMin across lanes
select(m, a, b)mask, 2 vectorsvectorPer-lane conditional
sqrt(x)scalar/vectorsame typeSquare root
rsqrt(x)scalar/vectorsame typeReciprocal square root
min(a, b)2 valuessame typeElement-wise minimum
max(a, b)2 valuessame typeElement-wise maximum
movemask(m)bool vectori32Extract lane mask to bits

Type System

Ea is statically typed with no implicit conversions. Every variable, parameter, and expression has a concrete type known at compile time.

Scalar Types

TypeSizeDescription
i81 byteSigned 8-bit integer
u81 byteUnsigned 8-bit integer
i162 bytesSigned 16-bit integer
u162 bytesUnsigned 16-bit integer
i324 bytesSigned 32-bit integer
u324 bytesUnsigned 32-bit integer
i648 bytesSigned 64-bit integer
u648 bytesUnsigned 64-bit integer
f324 bytes32-bit float (IEEE 754)
f648 bytes64-bit float (IEEE 754)
bool1 byteBoolean (true or false)

Integer literals default to i32. Float literals default to f32. Use explicit casts (to_f64(x), to_i64(x)) to convert between types.

Vector Types

Vector types hold multiple lanes of the same scalar type. Element-wise operations use dot-operators (.+, .-, .*, ./, etc.).

128-bit Vectors -- SSE (x86) / NEON (ARM)

TypeLanesElementSize
f32x44f3216 bytes
f64x22f6416 bytes
i32x44i3216 bytes
i16x88i1616 bytes
i8x1616i816 bytes
u8x1616u816 bytes
u16x88u1616 bytes

256-bit Vectors -- AVX2 (x86 only)

TypeLanesElementSize
f32x88f3232 bytes
f64x44f6432 bytes
i32x88i3232 bytes
i16x1616i1632 bytes
i8x3232i832 bytes
u8x3232u832 bytes
u16x1616u1632 bytes

These types produce a compile error on ARM targets.

512-bit Vectors -- AVX-512 (x86, --avx512 flag required)

TypeLanesElementSize
f32x1616f3264 bytes
f64x88f6464 bytes
i32x1616i3264 bytes

Using these types without --avx512 produces a compile error.

Pointer Types

Pointers represent caller-provided memory. Ea never allocates -- all memory comes from the host language.

SyntaxDescription
*TImmutable pointer to T
*mut TMutable pointer to T
*restrict TImmutable pointer, no-alias guarantee
*restrict mut TMutable pointer, no-alias guarantee

The restrict qualifier tells the compiler that the pointer does not alias other pointers, enabling stronger optimizations. Use it when you can guarantee non-overlapping memory.

Struct Types

User-defined value types declared with struct:

struct Pixel {
    r: f32,
    g: f32,
    b: f32,
    a: f32,
}

Structs are passed by value. Access fields with dot syntax: pixel.r. Structs can contain scalar types, vector types, and other structs.

Type Rules

  • No implicit conversions between types. Use to_f32(), to_i32(), etc.
  • No generics or polymorphism. Write separate functions for each type.
  • Vector dot-operators require both operands to have the same vector type.
  • Comparisons on vectors (.==, .<, .>) produce boolean vectors, not scalar bools.

All Intrinsics

Complete reference for every built-in function in Ea.

Memory

load

Load a vector from a pointer at a byte offset. The return type is inferred from context.

let v: f32x8 = load(ptr, i);

Typed Scalar Loads

Load a single scalar value from a pointer at a byte offset.

IntrinsicReturn Type
load_f32(ptr, i)f32
load_f64(ptr, i)f64
load_i32(ptr, i)i32
load_i16(ptr, i)i16
load_i8(ptr, i)i8
load_u8(ptr, i)u8
load_u16(ptr, i)u16
load_u32(ptr, i)u32
load_u64(ptr, i)u64
let x: f32 = load_f32(ptr, i);

Typed Vector Loads

Load a full vector from a pointer at a byte offset.

IntrinsicReturn Type
load_f32x4(ptr, i)f32x4
load_f32x8(ptr, i)f32x8
load_f32x16(ptr, i)f32x16
load_i32x4(ptr, i)i32x4
load_i32x8(ptr, i)i32x8
load_i16x8(ptr, i)i16x8
load_i8x16(ptr, i)i8x16
load_u8x16(ptr, i)u8x16
load_u8x32(ptr, i)u8x32
let v: f32x8 = load_f32x8(data, i * 32);

store

Write a vector to a pointer at a byte offset.

store(out, i, result);

stream_store

Non-temporal store that bypasses the CPU cache. Use for write-only output that will not be read back soon.

stream_store(out, i, result);

load_masked

Masked vector load. Lanes where the mask is false are not loaded.

let v: f32x8 = load_masked(ptr, i, mask);

store_masked

Masked vector store. Only lanes where the mask is true are written.

store_masked(out, i, value, mask);

gather

Load elements from scattered memory addresses using an index vector. x86 only -- not available on ARM.

let v: f32x8 = gather(ptr, indices);

scatter

Store elements to scattered memory addresses using an index vector. AVX-512 only (--avx512 flag required).

scatter(ptr, indices, values);

prefetch

Issue a prefetch hint to bring data into cache.

prefetch(ptr, i);

Math

sqrt

Square root. Works on scalar f32/f64 and all float vector types.

let y: f32 = sqrt(x);
let v: f32x8 = sqrt(vec);

rsqrt

Reciprocal square root (approximate). Scalar f32 and f32 vector types.

let y: f32 = rsqrt(x);
let v: f32x8 = rsqrt(vec);

exp

Exponential function. Float types.

let y: f32 = exp(x);

fma

Fused multiply-add: computes a * b + c in a single operation with one rounding step. Works on scalar f32 and all float vector types.

let y: f32 = fma(a, b, c);
let v: f32x8 = fma(va, vb, vc);

min

Element-wise minimum. Works on scalar (i32, f32, f64) and vector types.

let m: i32 = min(a, b);
let v: f32x8 = min(va, vb);

max

Element-wise maximum. Works on scalar (i32, f32, f64) and vector types.

let m: i32 = max(a, b);
let v: f32x8 = max(va, vb);

Reduction

Reduce a vector to a single scalar value.

reduce_add

Sum all lanes.

let sum: f32 = reduce_add(v);   // f32x8 -> f32
let sum: i32 = reduce_add(iv);  // i32x8 -> i32

reduce_max

Maximum across all lanes.

let m: f32 = reduce_max(v);

reduce_min

Minimum across all lanes.

let m: f32 = reduce_min(v);

reduce_add_fast

Unordered float reduction. Faster than reduce_add but does not guarantee summation order, so results may differ slightly due to floating-point rounding. Float vectors only.

let sum: f32 = reduce_add_fast(v);

Vector

splat

Broadcast a scalar value to all lanes of a vector. The vector type is inferred from context.

let v: f32x8 = splat(1.0);

shuffle

Reorder lanes of a vector according to an index tuple. The indices are compile-time constants.

let reversed: f32x4 = shuffle(v, (3, 2, 1, 0));

select

Per-lane conditional select. Where the mask is true, take from a; where false, take from b.

let result: f32x8 = select(mask, a, b);

movemask

Extract a comparison result bitmask from a boolean vector to a scalar i32. Each bit corresponds to the sign bit of one lane. x86 only -- not available on ARM.

let bits: i32 = movemask(cmp_result);

Conversion

Scalar Casts

IntrinsicDescription
to_f32(x)Convert to f32
to_f64(x)Convert to f64
to_i32(x)Convert to i32
to_i64(x)Convert to i64
let f: f32 = to_f32(i);
let n: i32 = to_i32(x);

Widening Conversions

Widen narrow integer lanes to wider float or integer lanes. Only the first N lanes of the input are consumed.

IntrinsicInputOutput
widen_i8_f32x4(v)i8x16f32x4
widen_u8_f32x4(v)u8x16f32x4
widen_i8_f32x8(v)i8x16f32x8
widen_u8_f32x8(v)u8x16f32x8
widen_u8_i32x4(v)u8x16i32x4
widen_u8_i32x8(v)u8x16i32x8
let pixels: f32x8 = widen_u8_f32x8(raw_bytes);

Narrowing Conversions

Convert wider lanes to narrower lanes, with clamping and rounding.

IntrinsicInputOutput
narrow_f32x4_i8(v)f32x4i8 (4 bytes)
let packed = narrow_f32x4_i8(float_pixels);

Multiply-Add Byte Pairs

Multiply unsigned bytes by signed bytes and add adjacent pairs. x86 only.

IntrinsicSignatureDescription
maddubs_i16(a, b)(u8x16, i8x16) -> i16x8Multiply and add adjacent pairs to 16-bit
maddubs_i32(a, b)(u8x16, i8x16) -> i32x4Multiply and add adjacent quads to 32-bit
let products: i16x8 = maddubs_i16(unsigned_bytes, signed_weights);

Debug

println

Print a value to stdout. Accepts scalars (i32, i64, u8, u16, u32, u64, f32, f64, bool), string literals, and vector types. Lowers to C printf. No format strings.

println(42);
println(3.14);
println("hello");
println(my_vector);

CLI Reference

The ea binary provides three commands: compile (default), bind, and inspect.

Compile

ea <file.ea> [flags]

Compile an Ea source file to a native object file (.o) by default.

Flags

FlagEffect
-o <name>Link the object file into an executable via cc
--libProduce a shared library (.so/.dll) and metadata (.ea.json)
--opt-level=NOptimization level 0--3 (default: 3)
--avx512Enable AVX-512 vector types and intrinsics. Errors on ARM targets
--target=CPULLVM CPU name, e.g. skylake, znver3, native (default: native)
--target-triple=TCross-compile to a different architecture, e.g. aarch64-unknown-linux-gnu
--emit-llvmWrite LLVM IR to a .ll file and print it to stdout
--emit-asmWrite assembly to a .s file
--headerGenerate a C header (.h) for the exported functions
--emit-astPrint the parsed AST. Does not require LLVM
--emit-tokensPrint the token stream. Does not require LLVM
--help / -hPrint usage information
--version / -VPrint compiler version

Examples

# Compile to object file
ea kernel.ea

# Compile and link to executable
ea kernel.ea -o kernel

# Build shared library for Python/Rust/C++ consumption
ea kernel.ea --lib

# Cross-compile for ARM
ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu

# Emit LLVM IR for debugging
ea kernel.ea --emit-llvm

# Compile with AVX-512 support
ea kernel.ea --lib --avx512

# Generate C header
ea kernel.ea --header

Compiler status output goes to stderr, so --emit-llvm stdout is clean for piping.

Bind

ea bind <file.ea> --python [--rust] [--cpp] [--pytorch] [--cmake]

Generate language bindings from a compiled kernel. At least one language flag is required.

Requires the .ea.json metadata file produced by ea <file.ea> --lib. Run the --lib compile first.

Language Flags

FlagOutputDescription
--python<name>.pyPython wrapper using ctypes
--rust<name>.rsRust FFI bindings
--cpp<name>.hppC++ header with inline wrappers
--pytorch<name>_torch.pyPyTorch custom op wrapper
--cmakeCMakeLists.txtCMake build file for C++ integration

Example

# Full workflow: compile, then generate Python bindings
ea kernel.ea --lib
ea bind kernel.ea --python

Inspect

ea inspect <file.ea> [target flags]

Post-optimization analysis of the compiled kernel. Shows instruction mix, loop structure, vector width usage, and register pressure. Accepts the same target flags as compile (--target, --target-triple, --avx512).

Example

ea inspect kernel.ea
ea inspect kernel.ea --avx512
ea inspect kernel.ea --target-triple=aarch64-unknown-linux-gnu
ea --print-target

Print the resolved native CPU name for the current machine. Useful for understanding which CPU features the compiler will target by default.

Python API

The ea Python package (pip install ea-compiler) provides a high-level interface for compiling and loading Ea kernels.

Functions

ea.load

ea.load(path, *, target="native", opt_level=3, avx512=False) -> KernelModule

Compile an .ea file to a shared library and load it. Returns a KernelModule object with each exported function available as a method.

ParameterTypeDefaultDescription
pathstrrequiredPath to the .ea source file
targetstr"native"LLVM CPU name (e.g. "skylake", "native")
opt_levelint3Optimization level 0--3
avx512boolFalseEnable AVX-512 types and intrinsics
import ea
import numpy as np

k = ea.load("kernel.ea")
a = np.array([1.0, 2.0, 3.0], dtype=np.float32)
b = np.zeros(3, dtype=np.float32)
k.my_func(a, b, len(a))

ea.compile

ea.compile(path, *, emit_asm=False, emit_llvm=False, target="native", opt_level=3, avx512=False, lib=True) -> Path

Compile an .ea file without loading it. Returns the path to the output file. Useful when you need the .so or .ea.json for another tool.

ParameterTypeDefaultDescription
pathstrrequiredPath to the .ea source file
emit_asmboolFalseAlso write a .s assembly file
emit_llvmboolFalseAlso write a .ll LLVM IR file
targetstr"native"LLVM CPU name
opt_levelint3Optimization level 0--3
avx512boolFalseEnable AVX-512
libboolTrueProduce .so + .ea.json

ea.clear_cache

ea.clear_cache(path=None)

Clear the compilation cache. If path is given, clear only the cache for that .ea file. If None, clear the entire cache directory.

ea.compiler_version

ea.compiler_version() -> str

Return the version string of the ea compiler binary.

ea.__version__

ea.__version__ -> str

The version of the ea Python package.

Exceptions

ea.CompileError

Raised when compilation fails. Inherits from RuntimeError.

AttributeTypeDescription
stderrstrFull compiler error output
exit_codeintCompiler process exit code
try:
    k = ea.load("broken.ea")
except ea.CompileError as e:
    print(e.stderr)
    print(e.exit_code)

Caching

Compiled shared libraries are cached in a __eacache__/ directory next to the .ea source file. The cache key includes the CPU name and compiler version:

__eacache__/{cpu}-{version}/kernel.so

The cache is invalidated by file modification time (mtime). If the source file is newer than the cached library, ea.load() recompiles automatically.

Length Collapsing

When a function parameter named n, len, length, count, size, or num appears immediately after a pointer parameter and has an integer type, the Python binding automatically fills it from the array's length. You do not need to pass it explicitly.

// Ea source
export func scale(data: *mut f32, n: i32, factor: f32) { ... }
# Python: n is auto-filled from len(data)
k.scale(data, factor=2.0)

Output Allocation

Parameters annotated with out and a [cap: ...] clause are automatically allocated by the Python binding and returned as the function's result.

// Ea source
export func transform(input: *f32, n: i32, out result: *mut f32 [cap: n]) { ... }
# Python: result is allocated and returned
output = k.transform(input_array)

If [count: path] is also specified, the returned array is trimmed to the actual output length.

Thread Safety

Loaded kernel modules and their functions are safe for concurrent use from multiple threads. The compiled code is stateless -- all memory is caller-provided via arguments.

Binding Annotations

Ea generates language bindings from compiled kernel metadata. This page documents the annotation syntax, the .ea.json metadata format, and the available binding generators.

Output Annotations

Mark a parameter as caller-allocated output with the out keyword and a capacity clause:

export func filter(
    input: *f32,
    n: i32,
    out result: *mut f32 [cap: n, count: out_count],
    out out_count: *mut i32 [cap: 1],
) { ... }

Syntax

out name: *mut T [cap: <expr>]
out name: *mut T [cap: <expr>, count: <path>]
ClauseRequiredDescription
capYesNumber of elements to allocate. Can reference other parameters (e.g. n, width * height)
countNoParameter or expression giving the actual output length. The binding trims the returned array to this length

In generated bindings, out parameters are not part of the function signature. They are allocated internally and returned as the function's result.

Length Collapsing

When a pointer parameter is followed immediately by an integer parameter whose name matches one of the recognized patterns, the binding generators automatically fill the integer from the array's length.

Recognized Names

n, len, length, count, size, num

Rules

  1. The integer parameter must appear immediately after a pointer parameter
  2. The integer parameter must have an integer type (i32, i64, u32, etc.)
  3. The pointer and integer are collapsed into a single array argument in the binding
// These two parameters collapse into one array argument:
export func sum(data: *f32, n: i32) -> f32 { ... }
# Python: just pass the array, n is filled automatically
result = k.sum(my_array)

Metadata Format (.ea.json)

Compiling with --lib produces a .ea.json file alongside the shared library. This file describes the exported API and is consumed by ea bind.

{
  "library": "libkernel.so",
  "exports": [
    {
      "name": "scale",
      "args": [
        {"name": "data", "type": "*mut f32"},
        {"name": "n", "type": "i32"},
        {"name": "factor", "type": "f32"}
      ],
      "return_type": null
    }
  ],
  "structs": [
    {
      "name": "Point",
      "fields": [
        {"name": "x", "type": "f32"},
        {"name": "y", "type": "f32"}
      ]
    }
  ]
}

Output-annotated parameters include additional fields:

{
  "name": "result",
  "type": "*mut f32",
  "output": true,
  "cap": "n",
  "count": "out_count"
}

Binding Generators

Generate bindings with ea bind <file.ea> and one or more language flags. The .ea.json file must exist (run ea <file.ea> --lib first).

FlagOutput FileDescription
--python<name>.pyPython wrapper using ctypes. Arrays as NumPy ndarrays.
--rust<name>.rsRust extern "C" declarations with safe wrapper functions
--cpp<name>.hppC++ header-only bindings with std::span parameters
--pytorch<name>_torch.pyPyTorch custom op with torch.Tensor parameters
--cmakeCMakeLists.txtCMake project for linking the shared library in C++

All generated bindings use C ABI function pointers loaded from the shared library at runtime.

ARM / NEON

Ea supports AArch64 with NEON vector instructions. This page documents the differences from x86 and how to write portable kernels.

Vector Width

ARM NEON provides 128-bit vector registers. Only 128-bit vector types are available:

SupportedNot Supported
f32x4f32x8, f32x16
f64x2f64x4, f64x8
i32x4i32x8, i32x16
i16x8i16x16
i8x16i8x32
u8x16u8x32
u16x8u16x16

Using a 256-bit or 512-bit vector type on an ARM target produces a compile error. This is intentional -- Ea does not silently fall back to scalar code.

Unavailable Intrinsics

The following intrinsics are x86-only and produce a compile error on ARM:

IntrinsicReason
movemask(v)No ARM equivalent for extracting lane sign bits to a bitmask
gather(ptr, indices)No hardware gather support in NEON
scatter(ptr, indices, values)AVX-512 only
maddubs_i16(a, b)x86 PMADDUBSW instruction, no NEON equivalent
maddubs_i32(a, b)x86 specific

All other intrinsics (loads, stores, math, reductions, splat, select, shuffle, conversions) work on ARM.

Cross-Compilation

Compile for ARM from an x86 host:

ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu

When compiling natively on an ARM machine, no special flags are needed:

ea kernel.ea --lib

The --avx512 flag is rejected on ARM targets with a compile error.

Writing Portable Kernels

Strategy 1: Use 128-bit types everywhere

Use f32x4, i32x4, etc. These work on both x86 (SSE) and ARM (NEON).

kernel scale(data: *mut f32, factor: f32) range(n) step(4) {
    let v: f32x4 = load(data, i);
    let f: f32x4 = splat(factor);
    store(data, i, v .* f);
}

This sacrifices throughput on x86 (which could use f32x8 with AVX2) but runs everywhere.

Strategy 2: Separate kernel files

Write platform-specific kernels in separate files and load the right one at runtime:

# kernel_x86.ea  -- uses f32x8 for AVX2
# kernel_arm.ea  -- uses f32x4 for NEON
import platform
if platform.machine() == "aarch64":
    k = ea.load("kernel_arm.ea")
else:
    k = ea.load("kernel_x86.ea")

Both files export the same function signatures, so the calling code does not change.

Strategy 3: Use the kernel construct

The kernel construct with step(N) lets you pick the vector width per file while keeping the loop logic identical. Write two .ea files with different step sizes and vector types, but the same exported function name and parameters.

Ea vs NumPy

When does writing a kernel in Ea actually beat NumPy? The answer comes down to one thing: arithmetic intensity -- how many operations you perform per byte loaded from memory.

The memory bandwidth wall

Modern CPUs can process arithmetic far faster than they can load data from DRAM. A single DDR4 channel delivers ~30-40 GB/s. NumPy's ufuncs are already compiled C with SIMD -- for simple operations, they saturate this bandwidth just like hand-written SIMD would.

Rule of thumb: if your operation does fewer than 2 arithmetic ops per element loaded, it is bandwidth-bound. Ea will match NumPy but not beat it.

Bandwidth-bound: Ea matches, no win

Array scaling

export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
    let f: f32x8 = splat(factor)
    let mut i: i32 = 0
    while i < n {
        let v: f32x8 = load(src, i)
        store(dst, i, v .* f)
        i = i + 8
    }
}
dst = src * factor  # NumPy: one SIMD multiply per element, same speed

One multiply per element loaded. Both Ea and NumPy hit ~35 GB/s on typical hardware. No winner.

Simple element-wise ops

Any operation that loads an element, does one thing, and stores the result is bandwidth-bound:

  • dst[i] = src[i] + offset
  • dst[i] = abs(src[i])
  • dst[i] = src_a[i] + src_b[i]

NumPy already handles these at memory bandwidth speed. Writing an Ea kernel gains you nothing.

Compute-bound: Ea wins

Fused scale + bias + clamp

NumPy must make three separate passes over memory:

dst = np.clip(src * scale + bias, 0.0, 1.0)  # 3 temporaries, 3 passes

Each pass loads the full array, computes one operation, and writes a temporary. For a 100 MB array, that is 600 MB of memory traffic.

Ea fuses everything into a single pass:

export func fused_scale_bias_clamp(src: *f32, dst: *mut f32, scale: f32, bias: f32, n: i32) {
    let s: f32x8 = splat(scale)
    let b: f32x8 = splat(bias)
    let zero: f32x8 = splat(0.0)
    let one: f32x8 = splat(1.0)
    let mut i: i32 = 0
    while i < n {
        let v: f32x8 = load(src, i)
        let result: f32x8 = fma(v, s, b)
        let clamped: f32x8 = min(max(result, zero), one)
        store(dst, i, clamped)
        i = i + 8
    }
}

One load, one FMA, two comparisons, one store. The data stays in registers the entire time. This is 3-5x faster than the NumPy version on large arrays because it reads memory once instead of three times.

Stencil operations

Convolutions, Sobel filters, and any operation that reads multiple neighboring elements per output pixel have high arithmetic intensity. A 3x3 Sobel kernel reads 9 values and performs 9 multiplications plus 8 additions per output -- well above the compute-bound threshold.

See Image Processing for stencil patterns.

Custom reductions with branching

NumPy cannot express per-element branching in vectorized form. Operations like "accumulate values, but skip negatives and double values above a threshold" require Python-level loops or awkward np.where chains.

Ea gives you SIMD comparisons and masked operations in a single loop body, keeping the pipeline full.

Dot products and FMA chains

Any reduction that multiplies and accumulates benefits from FMA (fused multiply-add) and register-level accumulation:

let acc: f32x8 = fma(a, b, acc)  // a * b + acc in one instruction

NumPy's np.dot is fast for large matrices (it calls BLAS), but for custom reductions, per-row operations, or non-standard accumulation patterns, Ea's explicit FMA beats NumPy's element-wise approach.

See ML Preprocessing for dot product and similarity patterns.

Decision checklist

Before writing an Ea kernel, ask:

  1. How many ops per element? If just one (scale, offset, abs), stay with NumPy.
  2. Does NumPy need multiple passes? If your expression chains 3+ operations, Ea's fusion wins.
  3. Is there a stencil or neighbor access? High arithmetic intensity -- Ea wins.
  4. Is there branching logic per element? NumPy cannot vectorize this -- Ea wins.
  5. Is it a standard BLAS operation? Use NumPy/SciPy -- they call optimized libraries.

Real-world packages

These packages demonstrate Ea beating NumPy on compute-bound workloads:

  • easobel -- Sobel edge detection (stencil, ~9 ops/pixel)
  • eastat -- CSV parsing (branching SIMD scan)
  • eavec -- Vector similarity search (FMA-heavy dot products)

Image Processing

Image processing is one of Ea's strongest use cases. Stencil operations (convolution, edge detection, blur) read multiple neighbors per output pixel, giving high arithmetic intensity that keeps the CPU compute-bound rather than memory-bound.

Sobel edge detection

The Sobel filter computes horizontal and vertical gradients using 3x3 stencils. Each output pixel reads 9 input values and performs 9 multiplications plus additions -- well above the threshold where Ea beats NumPy.

The pattern: for each pixel, load the 3x3 neighborhood, multiply by Sobel coefficients, and sum:

export func sobel_x(
    src: *f32, dst: *mut f32,
    width: i32, height: i32
) {
    let neg1: f32x8 = splat(-1.0)
    let pos1: f32x8 = splat(1.0)
    let neg2: f32x8 = splat(-2.0)
    let pos2: f32x8 = splat(2.0)
    let mut y: i32 = 1
    while y < height - 1 {
        let mut x: i32 = 0
        while x < width - 2 {
            let row_above: i32 = (y - 1) * width + x
            let row_center: i32 = y * width + x
            let row_below: i32 = (y + 1) * width + x

            let tl: f32x8 = load(src, row_above)
            let tr: f32x8 = load(src, row_above + 2)
            let ml: f32x8 = load(src, row_center)
            let mr: f32x8 = load(src, row_center + 2)
            let bl: f32x8 = load(src, row_below)
            let br: f32x8 = load(src, row_below + 2)

            let gx: f32x8 = (tr .* pos1) .+ (mr .* pos2) .+ (br .* pos1)
                .+ (tl .* neg1) .+ (ml .* neg2) .+ (bl .* neg1)
            store(dst, row_center + 1, gx)
            x = x + 8
        }
        y = y + 1
    }
}

For a production-ready Sobel implementation with both Gx/Gy gradients and magnitude computation, see easobel.

Pixel pipeline: u8 to f32 and back

Images from disk arrive as u8 (0-255). SIMD math works best on f32. The typical pattern:

  1. Load u8 pixels
  2. Widen to f32 (0.0 to 255.0)
  3. Process in f32 (normalize, filter, blend)
  4. Narrow back to u8

Widening: u8 to f32

export func normalize_u8_to_f32(src: *u8, dst: *mut f32, n: i32) {
    let scale: f32x4 = splat(0.00392156862)
    let mut i: i32 = 0
    while i < n {
        let pixels: f32x4 = widen_u8_f32x4(src, i)
        let normalized: f32x4 = pixels .* scale
        store(dst, i, normalized)
        i = i + 4
    }
}

widen_u8_f32x4(ptr, offset) loads 4 bytes from src + offset, zero-extends each to 32 bits, and converts to float. The result is a f32x4 with values in 0.0 to 255.0. Multiply by 1/255 to get the 0.0-1.0 range.

Narrowing: f32 to u8

export func f32_to_u8(src: *f32, dst: *mut u8, n: i32) {
    let s255: f32x4 = splat(255.0)
    let zero: f32x4 = splat(0.0)
    let mut i: i32 = 0
    while i < n {
        let v: f32x4 = load(src, i)
        let clamped: f32x4 = min(max(v, zero), s255)
        narrow_f32x4_i8(dst, i, clamped)
        i = i + 4
    }
}

narrow_f32x4_i8(ptr, offset, vec) converts 4 floats to integers, saturates to 0-255, and stores 4 bytes. Always clamp before narrowing to avoid overflow.

Putting it together

A full pixel pipeline (load u8, process in f32, store u8) processes 4 pixels per iteration on both x86 and ARM. Use f32x4 for the pipeline to keep it portable -- f32x8 works only on x86 with AVX2.

Why image stencils are compute-bound

A 3x3 convolution on a single-channel image performs 9 multiplications and 8 additions per output pixel, but only produces 1 output value. That is 17 arithmetic operations per output float -- far above the ~2 ops/element threshold where Ea's operation fusion matters.

For multi-channel images (RGB, RGBA), the arithmetic intensity is even higher because you process 3-4 channels per pixel position.

Compare this to simple brightness adjustment (pixel * 1.1), which is 1 op per element -- bandwidth-bound, and NumPy handles it just as fast. See Ea vs NumPy for more on this distinction.

Tips

  • Border handling: the examples above skip border pixels (starting at y=1, ending at height-1). For production kernels, handle borders separately with scalar code or clamped indexing.
  • Separable filters: Gaussian blur and similar filters can be split into horizontal and vertical passes, reducing a 3x3 stencil from 9 to 6 operations. Each pass is still compute-bound.
  • ARM portability: use f32x4 and i32x4 for kernels that need to run on both x86 and ARM. The 128-bit types work on both architectures.

Text Processing

Text processing benefits from SIMD when you can skip large regions of uninteresting bytes. The key pattern is chunk-skip: load 32 bytes at a time, check if any byte matches a target character, and skip the entire chunk if none match.

The chunk-skip pattern

Most bytes in a text file are not the character you are looking for. A newline scanner over a 1 MB file might find 10,000 newlines among 1,000,000 bytes -- 99% of chunks contain no match and can be skipped with a single comparison and branch.

export func count_newlines(data: *u8, n: i32) -> i32 {
    let newline: u8x32 = splat_u8x32(10)
    let mut count: i32 = 0
    let mut i: i32 = 0
    while i < n - 31 {
        let chunk: u8x32 = load_u8x32(data, i)
        let matches: u8x32 = chunk .== newline
        let mask: i32 = movemask(matches)
        if mask != 0 {
            let mut j: i32 = 0
            while j < 32 {
                if load_u8(data, i + j) == 10 {
                    count = count + 1
                }
                j = j + 1
            }
        }
        i = i + 32
    }
    let mut k: i32 = i
    while k < n {
        if load_u8(data, k) == 10 {
            count = count + 1
        }
        k = k + 1
    }
    count
}

The structure:

  1. Load 32 bytes as u8x32
  2. Compare with splat_u8x32(target) using .== to get a match vector
  3. movemask() collapses the vector comparison to a single i32 bitmask
  4. If mask is 0: no matches in this chunk, skip ahead 32 bytes (fast path)
  5. If mask is nonzero: scan the 32 bytes individually (slow path, but rare)
  6. Scalar tail: handle remaining bytes that do not fill a full 32-byte chunk

The fast path processes 32 bytes in about 3 instructions. On typical text, 90-99% of chunks take the fast path.

Why not extract individual bit positions?

Ea does not have shift operators (<<, >>), so you cannot extract individual bit positions from the movemask result. Instead, when a chunk contains matches, fall back to byte-by-byte scanning within that 32-byte window. This is still fast because:

  • Hot chunks are rare (most text is not your target character)
  • The 32-byte scan is a tight loop that fits in L1 cache
  • The overall speedup comes from skipping cold chunks, not from optimizing hot ones

ARM portability

movemask is an x86-only intrinsic (it maps to vpmovmskb). On ARM/NEON, there is no equivalent single instruction. For portable kernels, write separate x86 and ARM versions:

  • x86: use u8x32 + movemask as shown above
  • ARM: use u8x16 with scalar fallback, or structure the algorithm to avoid needing a bitmask

See ARM / NEON for details on architecture-specific intrinsics.

CSV parsing

CSV parsing combines the chunk-skip pattern with state tracking. You need to find commas and newlines while respecting quoted fields -- a comma inside quotes is not a delimiter.

The approach:

  1. Scan for quote characters (") using chunk-skip to track quote state
  2. Scan for delimiters (, and \n) using chunk-skip, filtering by quote state
  3. Parse numeric fields from the delimited regions

This is compute-bound because each byte potentially involves comparison against multiple target characters plus state logic -- exactly the kind of branching work that NumPy cannot vectorize.

For a complete CSV statistics package built on this pattern, see eastat.

Tips

  • Buffer alignment: SIMD loads from unaligned addresses work but may be slower on older hardware. If you control the buffer, align to 32 bytes.
  • Tail handling: always include a scalar loop for the last n % 32 bytes. Forgetting the tail is the most common bug in SIMD text processing.
  • Multi-character search: to find any of several characters (e.g., <, >, & for HTML), do multiple .== comparisons and combine with .| before the movemask.

ML Preprocessing

ML preprocessing pipelines often apply the same sequence of operations to every element in large arrays: normalize, scale, compute similarities. These are natural fits for Ea kernels because they fuse multiple operations into single-pass loops.

Normalize: zero-mean, unit-variance

The standard normalization (x - mean) / std requires two operations per element. NumPy computes them as separate passes:

normalized = (data - mean) / std  # 2 passes, 2 temporaries

Ea fuses this into one pass using FMA. Rewrite (x - mean) / std as x * (1/std) + (-mean/std):

export func normalize(
    src: *f32, dst: *mut f32,
    inv_std: f32, neg_mean_div_std: f32,
    n: i32
) {
    let s: f32x8 = splat(inv_std)
    let b: f32x8 = splat(neg_mean_div_std)
    let mut i: i32 = 0
    while i < n {
        let v: f32x8 = load(src, i)
        let result: f32x8 = fma(v, s, b)
        store(dst, i, result)
        i = i + 8
    }
}

The caller precomputes inv_std = 1.0 / std and neg_mean_div_std = -mean / std in Python. The kernel then does a single FMA per 8 elements -- one instruction that multiplies and adds simultaneously.

Dot product: dual-accumulator pattern

A naive dot product uses one accumulator:

let mut acc: f32x8 = splat(0.0)
while i < n {
    let a: f32x8 = load(x, i)
    let b: f32x8 = load(y, i)
    acc = fma(a, b, acc)
    i = i + 8
}

This leaves performance on the table. FMA has a latency of 4-5 cycles but a throughput of 1 per cycle. With one accumulator, each FMA waits for the previous one to finish.

Dual accumulators hide this latency by interleaving independent FMA chains:

export func dot_product(x: *f32, y: *f32, out: *mut f32, n: i32) {
    let mut acc0: f32x8 = splat(0.0)
    let mut acc1: f32x8 = splat(0.0)
    let mut i: i32 = 0
    while i < n - 15 {
        let a0: f32x8 = load(x, i)
        let b0: f32x8 = load(y, i)
        acc0 = fma(a0, b0, acc0)
        let a1: f32x8 = load(x, i + 8)
        let b1: f32x8 = load(y, i + 8)
        acc1 = fma(a1, b1, acc1)
        i = i + 16
    }
    while i < n {
        let a: f32x8 = load(x, i)
        let b: f32x8 = load(y, i)
        acc0 = fma(a, b, acc0)
        i = i + 8
    }
    let combined: f32x8 = acc0 .+ acc1
    let result: f32 = horizontal_sum(combined)
    store(out, 0, splat(result))
}

The CPU can execute fma(a0, b0, acc0) and fma(a1, b1, acc1) in parallel because they use different accumulators. This typically doubles throughput on modern CPUs.

Cosine similarity

Cosine similarity needs three reductions in one pass: dot(a, b), norm(a), and norm(b). Computing these separately means three passes over the data. Ea fuses all three:

export func cosine_similarity(a: *f32, b: *f32, out: *mut f32, n: i32) {
    let mut dot_acc: f32x8 = splat(0.0)
    let mut norm_a_acc: f32x8 = splat(0.0)
    let mut norm_b_acc: f32x8 = splat(0.0)
    let mut i: i32 = 0
    while i < n {
        let va: f32x8 = load(a, i)
        let vb: f32x8 = load(b, i)
        dot_acc = fma(va, vb, dot_acc)
        norm_a_acc = fma(va, va, norm_a_acc)
        norm_b_acc = fma(vb, vb, norm_b_acc)
        i = i + 8
    }
    let dot_val: f32 = horizontal_sum(dot_acc)
    let norm_a_val: f32 = horizontal_sum(norm_a_acc)
    let norm_b_val: f32 = horizontal_sum(norm_b_acc)
    let result: f32 = dot_val / sqrt(norm_a_val * norm_b_val)
    store(out, 0, splat(result))
}

Three FMAs per loop iteration, all operating on the same loaded vectors. The data passes through cache once. NumPy would need np.dot(a, b), np.linalg.norm(a), and np.linalg.norm(b) -- three separate passes.

Batch operations

For ML workloads, you often apply the same operation to many rows. The Python side handles the loop over rows, calling the Ea kernel for each:

import ea

lib = ea.load("similarity.ea")
for i in range(num_queries):
    lib.cosine_similarity(query[i], database[j], result, dim)

The kernel handles the inner loop (over vector dimensions) with SIMD. The outer loop (over queries or rows) stays in Python. This is the right split -- the inner loop is where SIMD matters.

Real-world package

eavec implements vector similarity search using the dual-accumulator FMA pattern. It computes cosine similarity across a database of vectors, returning the top-k most similar results.

Beating NumPy's BLAS at Constant-Q Transform with 80 Lines of Ea

A SIMD kernel compiler, a DFT nobody asked for, and an honest benchmark.


The Problem

Audio spectrum visualizers typically use FFT. FFT gives you linearly-spaced frequency bins — great for math, terrible for music. Human pitch perception is logarithmic: the distance from C2 to C3 (one octave) matters as much as C5 to C6. But FFT allocates equal resolution across the entire spectrum, wasting detail in the bass and over-resolving the treble.

The Constant-Q Transform (CQT) solves this. Each frequency bin gets its own window length — long windows for low frequencies (more cycles needed to resolve pitch), short windows for high frequencies. The result: 84 bins spanning 7 octaves, one per semitone, C2 through B8.

The catch: FFT can't do CQT. You need a DFT-per-bin, which is O(n*k) instead of O(n log n). The standard NumPy approach is to build a complex kernel matrix and let BLAS handle it with a single matrix-vector multiply.

We wanted to see if a hand-written SIMD kernel in Ea could beat that.

What is Ea

Ea is a compute kernel compiler. You write .ea files in a C-like language with explicit SIMD vector types, compile them to native shared libraries, and call them from Python with NumPy arrays. No C toolchain, no Cython, no JIT warmup.

export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
    let s: f32x8 = splat(factor)
    let mut i: i32 = 0
    while i < n {
        let v: f32x8 = load(src, i)
        store(dst, i, v .* s)
        i = i + 8
    }
}

Ea doesn't have sin or cos intrinsics. This is a deliberate design choice — trig is a policy decision (how many polynomial terms? what accuracy? what range?) that the language refuses to hide behind a simple-looking function call. If you need trig, you precompute tables in Python or write an explicit FMA polynomial chain.

This turns out to be the key insight for CQT.

The Design

A CQT with 84 bins (7 octaves, 12 semitones each) starting at C2 (65.4 Hz) at 44.1 kHz sample rate has these properties:

BinNoteFrequencyWindow Length
0C265.4 Hz11,339 samples
33A4440 Hz1,684 samples
83B87,903 Hz94 samples

The quality factor Q = 16.8 (constant across all bins — that's what "constant-Q" means). Total work per frame: 200,476 FMA operations across all bins.

The approach: precompute cos/sin twiddle factor tables in Python (one-time cost, 1.6 MB), then let Ea handle the per-frame FMA loop. We bake the Hann window directly into the twiddle factors during precomputation:

# Python: precompute once
for k in range(n_bins):
    n_k = int(lengths[k])
    i = np.arange(n_k, dtype=np.float64)
    window = 0.5 * (1 - np.cos(2 * np.pi * i / n_k))
    angle = 2 * np.pi * freqs[k] * i / SAMPLE_RATE
    cos_parts.append((window * np.cos(angle)).astype(np.float32))
    sin_parts.append((window * -np.sin(angle)).astype(np.float32))

This means the Ea kernel needs zero trig and zero windowing logic. The inner loop is pure FMA.

The Kernel

Version 1: Direct DFT with Dual Accumulators

The first kernel processes one frequency bin at a time. For each bin, it reads the audio segment, multiplies against precomputed cos and sin twiddle factors, accumulates, and computes the magnitude. Smoothing (exponential decay for the "falling bars" effect) is fused into the output — no separate kernel call, no intermediate array.

Two independent FMA accumulator chains (r0/r1, i0/i1) hide pipeline latency on superscalar CPUs. This is the same technique used in high-performance reduction kernels.

export func cqt_fused(
    audio: *f32,
    cos_table: *f32,
    sin_table: *f32,
    offsets: *i32,      // per-bin start offset into twiddle tables
    lengths: *i32,      // per-bin window length
    prev_mags: *f32,
    out_mags: *mut f32,
    alpha: f32,         // decay factor
    n_bins: i32,
    max_window_len: i32
) {
    let mut k: i32 = 0
    while k < n_bins {
        let off: i32 = offsets[k]
        let win_len: i32 = lengths[k]
        let audio_start: i32 = max_window_len - win_len

        let mut r0: f32x8 = splat(0.0)
        let mut r1: f32x8 = splat(0.0)
        let mut i0: f32x8 = splat(0.0)
        let mut i1: f32x8 = splat(0.0)

        let mut i: i32 = 0
        while i + 16 <= win_len {
            prefetch(cos_table, off + i + 64)
            prefetch(sin_table, off + i + 64)

            let s0: f32x8 = load(audio, audio_start + i)
            let s1: f32x8 = load(audio, audio_start + i + 8)
            let c0: f32x8 = load(cos_table, off + i)
            let c1: f32x8 = load(cos_table, off + i + 8)
            let sn0: f32x8 = load(sin_table, off + i)
            let sn1: f32x8 = load(sin_table, off + i + 8)

            r0 = fma(s0, c0, r0)
            r1 = fma(s1, c1, r1)
            i0 = fma(s0, sn0, i0)
            i1 = fma(s1, sn1, i1)

            i = i + 16
        }

        // 8-element and scalar tails omitted for brevity

        let real: f32 = reduce_add(r0 .+ r1)
        let imag: f32 = reduce_add(i0 .+ i1)
        let mag: f32 = sqrt(real * real + imag * imag)

        // Fused smooth decay
        let decayed: f32 = prev_mags[k] * alpha
        if mag > decayed {
            out_mags[k] = mag
        } else {
            out_mags[k] = decayed
        }

        k = k + 1
    }
}

The generated assembly for the hot loop is clean — the Ea compiler folds twiddle loads directly into vfmadd231ps memory operands, so the audio loads (vmovups) are the only explicit loads:

.LBB0_4:
    prefetcht0  (%rsi,%r12,4)      ; prefetch cos_table
    prefetcht0  (%rdx,%r12,4)      ; prefetch sin_table
    vmovups     (%rdi,%r12,4), %ymm5   ; load audio[i..i+7]
    vmovups     (%rdi,%r12,4), %ymm6   ; load audio[i+8..i+15]
    vfmadd231ps (%rsi,%r12,4), %ymm5, %ymm3  ; r0 += audio * cos
    vfmadd231ps (%rsi,%rbp,4), %ymm6, %ymm4  ; r1 += audio * cos
    vfmadd231ps (%rdx,%r12,4), %ymm5, %ymm2  ; i0 += audio * sin
    vfmadd231ps (%rdx,%rbp,4), %ymm6, %ymm1  ; i1 += audio * sin

Four FMAs per iteration, each operating on 8 floats. 64 floating-point multiply-adds per loop cycle.

First Benchmark: The Lie

Our first benchmark compared this against a Python for-loop CQT:

def numpy_cqt(audio, freqs, max_window_len):
    magnitudes = np.empty(N_BINS, dtype=np.float32)
    for k in range(N_BINS):
        n_k = int(np.ceil(Q * SAMPLE_RATE / freqs[k]))
        segment = audio[start:start + n_k]
        window = 0.5 * (1 - np.cos(2 * np.pi * i / n_k))
        kernel = window * np.exp(-2j * np.pi * freqs[k] * i / SAMPLE_RATE)
        magnitudes[k] = np.abs(np.sum(segment * kernel))
    return magnitudes

Result: 153x faster. We almost shipped this number.

The problem: this "NumPy CQT" is a Python for-loop with per-bin allocations. Nobody would write production CQT this way. The honest NumPy approach is to precompute a (84, 11339) complex kernel matrix and do a single np.dot:

# Precompute once: (n_bins, max_window_len) complex64 matrix
kernel_matrix = np.zeros((N_BINS, max_window_len), dtype=np.complex64)
for k in range(N_BINS):
    # ... fill in windowed complex exponentials, zero-pad shorter bins

# Per frame: single BLAS call
magnitudes = np.abs(kernel_matrix @ audio)

This calls directly into OpenBLAS/MKL — decades of hand-tuned assembly for matrix-vector multiply.

Honest Benchmark v1

Methodology:

  • All competitors precompute everything they can (tables, windows, indices)
  • Only per-frame work is timed: transform + magnitude + smoothing
  • Same smoothing (max(current, previous * alpha)) applied to all

With N=500 iterations and inadequate warmup:

Time
Ea CQT0.098 ms
NumPy matmul0.085 ms

NumPy was winning. The 153x headline collapsed to Ea being 1.1x slower.

Fixing the Benchmark

The N=500 result was noisy. WSL2 scheduling and insufficient warmup made the numbers unreliable. We moved to a min-of-trials methodology: 10 independent trials of 2,000 iterations each, reporting both min (best-case, no OS interference) and median (real-world scheduling).

With proper warmup (200+ iterations before timing):

MinMedian
Ea CQT50 us55 us
NumPy matmul95 us190 us

Ea was already 1.9x faster at best, 3.5x at median. The initial "1.1x slower" was just noise.

Optimization Attempts

Quad Accumulators

Theory: 4 independent FMA chains (8 accumulator registers) should hide more pipeline latency than 2 chains.

Result: slower (100 us vs 83 us). With 8 YMM accumulator registers + 12 data registers for loads, we exceed x86-64's 16 YMM register limit. The compiler spills to stack memory, killing the very latency hiding we wanted.

Interleaved Loads

Theory: load cos, immediately FMA, then load sin and FMA. Give the memory subsystem more time to fetch by interleaving loads with compute.

Result: slower (191 us vs 53 us). The original pattern — batch all loads, then batch all FMAs — is better for out-of-order execution. The CPU's reorder buffer can schedule loads far ahead of their consumers when they're grouped together.

Merged Twiddle Table

Theory: interleave cos and sin data for each bin into a single contiguous memory region [cos_bin0 | sin_bin0 | cos_bin1 | ...] to halve cache line traffic.

Result: no change (53.8 us vs 53.1 us). The hardware prefetcher already handles two sequential streams efficiently. The second stream (sin_table) gets prefetched in parallel with the first (cos_table) because the access pattern is identical — sequential scan with the same stride.

In-Place Smoothing

Instead of writing to out_mags and copying back to prev_mags in Python, the kernel reads and writes the same buffer:

export func cqt_inplace(
    audio: *f32,
    cos_table: *f32,
    sin_table: *f32,
    offsets: *i32,
    lengths: *i32,
    mags: *mut f32,    // read previous, write smoothed result
    alpha: f32,
    n_bins: i32,
    max_window_len: i32
)

Result: ~5% faster (47.8 us vs 50.2 us). Modest win from eliminating a 336-byte numpy copy per frame plus one Python function call.

Final Numbers

Methodology: 10 trials of 2,000 iterations each, 200-iteration warmup per trial, min-of-trials reported. All precomputation excluded. All competitors include smoothing. NumPy uses pre-allocated output buffers (out= parameter) to avoid allocation overhead.

MinMedian
Ea CQT (fused SIMD)~50 us~55 us
NumPy CQT (BLAS matmul)~95 us~190 us
NumPy FFT (wrong result)~106 us~118 us

Ea vs BLAS: 1.9x faster (min), 3.5x (median) Ea vs FFT: 2.1x faster — and FFT gives the wrong answer Throughput: 7.8 GFLOP/s

Why Ea Wins

1. Ea reads 4.8x less data. The BLAS matmul uses a dense (84, 11339) complex64 matrix — 7.4 MB per frame, including all the zero-padded regions where shorter bins have no data. Ea's variable-length inner loops skip zeros entirely, touching only 200,476 real elements (1.6 MB).

2. Real vs complex arithmetic. BLAS operates on complex64 (pairs of float32). Every BLAS FMA processes a complex multiply-add — 4 real multiplies and 2 real adds per element. Ea works directly in float32, doing 2 real FMAs per element (one for the cos component, one for sin).

3. Fused pipeline. Ea does window + transform + magnitude + smoothing in one function call. NumPy requires 4 separate calls: matmul, abs, maximum, copyto. Each call crosses the Python/C boundary, allocates or writes to buffers, and makes a separate pass over the output data.

4. Lower scheduling jitter. One ctypes call has less interrupt surface than four NumPy calls. This explains the median gap (3.5x) being larger than the min gap (1.9x) — Ea's single kernel call is less likely to be interrupted mid-computation.

Why Ea Doesn't Win More

The dual-accumulator inner loop is already close to optimal for AVX2. The generated assembly has 4 vfmadd231ps instructions per iteration with memory operands — the compiler is folding loads into FMAs. There's no instruction waste.

The bottleneck is memory bandwidth. At 50 us for 1.6 MB of twiddle table reads, we're pulling ~32 GB/s — respectable for L3 cache, but below the theoretical memory bandwidth. The per-bin overhead (loading offsets/lengths, setting up accumulators, horizontal reduction) fragments what could be one continuous stream into 84 separate sweeps.

BLAS doesn't have this problem. Its matmul is one uninterrupted sweep through a dense matrix, which modern CPUs and prefetchers are specifically optimized for. BLAS trades compute (touching zeros) for access pattern regularity — and on some runs, when the OS cooperates perfectly, that trade nearly pays off.

The Code

Three files. The full kernel (cqt.ea, 80 lines), the Python visualizer, and the benchmark. The kernel uses the language as intended: explicit types, explicit vector widths, explicit memory access. No hidden costs.

import ea
import numpy as np

kernel = ea.load("cqt.ea")

# Precompute tables once
freqs = F_MIN * 2 ** (np.arange(84) / 12)
lengths = np.ceil(Q * SAMPLE_RATE / freqs).astype(np.int32)

# ... build cos/sin twiddle tables with baked-in Hann window ...

# Per frame: one call, ~50 us
kernel.cqt_fused(
    audio_buffer, cos_table, sin_table,
    offsets, lengths, prev_mags, out_mags,
    decay_alpha, n_bins, max_window_len,
)

The full source is on GitHub.

What We Learned

The first benchmark said 153x. That was a lie — comparing optimized SIMD against a Python for-loop. The second benchmark said 1.1x slower. That was noise — insufficient warmup on a noisy WSL2 scheduler. The real number is 1.9x faster, earned by reading less data, using simpler arithmetic, and fusing the pipeline.

Three optimization attempts failed (quad accumulators, interleaved loads, merged tables) and one gave a modest 5% win (in-place smoothing). The lesson: when the compiler is already generating clean assembly, micro-optimizations rarely help. The wins came from the algorithm design — variable-length windows, baked-in windowing, fused smoothing — not from squeezing the inner loop.

Ea didn't win by being a faster compiler. It won by making it easy to write the right algorithm. A kernel that skips zeros, fuses four operations, and makes one function call will beat a kernel that touches every element of a zero-padded matrix and makes four function calls — even when the latter is backed by BLAS.

How 30 Lines of Eä Beat NumPy by 6×

And why your framework is probably slower than a for-loop.


The Pitch

Here's a fused multiply-add: out[i] = a[i] * b[i] + c[i]. Sixteen million times.

NumPy does it in 46 milliseconds. Eä does it in 7 milliseconds. That's 6.6× faster.

Here's the kicker: the Eä kernel is 30 lines. The NumPy version is one line. And the one-liner loses because it's actually two lines pretending to be one.

The NumPy Version

out = a * b + c

Simple. Elegant. Two full scans of 64 MB of data.

NumPy computes a * b first, writes 64 MB to a temporary array, then reads it back to add c. That's 256 MB of memory traffic for 192 MB of actual data. Every element gets loaded, stored, loaded again, and stored again.

On a modern CPU at ~35 GB/s memory bandwidth, that's a floor of about 7 milliseconds just for memory. NumPy hits 46 ms because it also has Python dispatch overhead, array allocation, and — crucially — it can't fuse the multiply and add into a single instruction.

The Eä Version

export func fma_f32x8(
    a: *restrict f32,
    b: *restrict f32,
    c: *restrict f32,
    out: *mut f32,
    n: i32
) {
    let mut acc0: f32x8 = splat(0.0)
    let mut acc1: f32x8 = splat(0.0)
    let mut i: i32 = 0
    while i + 16 <= n {
        let va0: f32x8 = load(a, i)
        let vb0: f32x8 = load(b, i)
        let vc0: f32x8 = load(c, i)
        store(out, i, fma(va0, vb0, vc0))

        let va1: f32x8 = load(a, i + 8)
        let vb1: f32x8 = load(b, i + 8)
        let vc1: f32x8 = load(c, i + 8)
        store(out, i + 8, fma(va1, vb1, vc1))

        i = i + 16
    }
    while i < n {
        out[i] = a[i] * b[i] + c[i]
        i = i + 1
    }
}

One pass. Load a, b, c, fuse multiply-add in a single vfmadd instruction, store result. Each element is touched once. Total memory traffic: 256 MB (same 4 arrays), but only one scan instead of two.

The *restrict tells LLVM the arrays don't alias, enabling aggressive optimization. The f32x8 processes 8 floats per instruction on AVX2. The 2× unroll (processing 16 elements per iteration) hides memory latency.

That's it. No magic. Just not doing twice the work.

"But I'll Just Use Ray"

We benchmarked that too. For science.

MethodTimeThroughputvs NumPy
NumPy (multiply + add)46,000 µs5.6 GB/sbaseline
Eä (1 thread)6,900 µs37.0 GB/s6.6×
Eä (2 threads)6,500 µs39.1 GB/s7.0×
Dask (2 chunks)56,000 µs4.6 GB/s0.8×
Ray (2 workers)89,000 µs2.9 GB/s0.5×

Ray is twice as slow as NumPy. For FMA. On two cores. On the same machine.

Why? Serialization. Ray pickles your 64 MB arrays, sends them to worker processes, unpickles them, runs NumPy (which does two passes), pickles the results, and sends them back. The actual compute takes 46 ms. The overhead takes another 43 ms.

Dask is better — it chunks lazily and uses NumPy under the hood — but it still can't fuse operations or control SIMD width. It's paying for an abstraction layer over code that's already fast enough.

The uncomfortable truth: for single-machine numerical work, a for-loop that touches memory once will beat a distributed framework that touches memory twice.

The Dot Product Story

We weren't always winning. Our first benchmark had Eä's dot product at 0.27× of BLAS. Embarrassing.

// The naive version — DO NOT ship this
export func dot_naive(a: *f32, b: *f32, n: i32) -> f32 {
    let mut acc: f32 = 0.0
    let mut i: i32 = 0
    while i < n {
        acc = acc + a[i] * b[i]
        i = i + 1
    }
    return acc
}

Single scalar accumulator. Each iteration depends on the previous one. The CPU stalls waiting for the addition to complete before it can start the next multiply. Classic dependency chain bottleneck.

BLAS uses the trick every HPC programmer knows: multiple independent accumulators.

// The optimized version — ships in autoresearch/kernels/dot_product/
export func dot_f32x8(a: *restrict f32, b: *restrict f32, len: i32) -> f32 {
    let mut acc0: f32x8 = splat(0.0)
    let mut acc1: f32x8 = splat(0.0)
    let mut i: i32 = 0
    while i + 32 <= len {
        acc0 = fma(load(a, i),      load(b, i),      acc0)
        acc1 = fma(load(a, i + 8),  load(b, i + 8),  acc1)
        acc0 = fma(load(a, i + 16), load(b, i + 16), acc0)
        acc1 = fma(load(a, i + 24), load(b, i + 24), acc1)
        i = i + 32
    }
    // ... tail handling ...
    return reduce_add_fast(acc0 .+ acc1)
}

Two f32x8 accumulators (acc0, acc1) with 4× unroll. The CPU has 16 FMA operations in flight that don't depend on each other. The result:

MethodTimeGB/svs BLAS
NumPy BLAS sdot3,535 µs35.9baseline
Eä naive (scalar)13,222 µs9.70.27×
Eä f32x4 (1 acc)4,474 µs28.60.79×
Eä f32x8 (dual acc, 4× unroll)3,500 µs36.61.01×

From 0.27× to 1.01× — by changing the loop structure, not the algorithm. Both versions compute the same dot product. The fast one just asks the CPU to think about 32 elements instead of 1.

What You're Actually Seeing

This isn't "Eä is fast." Eä is a thin wrapper around LLVM. The FMA kernel compiles to about 12 assembly instructions in the inner loop. Eä's job is to make writing those 12 instructions feel like writing 30 lines of readable code instead of 200 lines of intrinsics.

The real insight is about NumPy's cost model:

OperationNumPy
a * b + c2 passes over data1 pass (fused FMA)
Temporary arrays1 allocation (64 MB)0 allocations
SIMD width"whatever the compiler picks"explicit f32x8
Memory round-trips2 (multiply result → RAM → add)0

NumPy is fast per operation. But real workloads aren't one operation — they're pipelines. And every np. call is a full scan of your data.

When Eä Doesn't Win

Bandwidth-bound operations with a single arithmetic op per element. Scaling an array:

out = src * 2.5

NumPy: 5,631 µs. Eä: 6,138 µs. NumPy wins (by 8%).

Both saturate memory bandwidth at ~22 GB/s. There's nothing to fuse. One multiply per element loaded. No amount of SIMD cleverness makes DRAM faster.

This is the honest dividing line: if your inner loop does fewer than ~4 operations per element, NumPy is fast enough. If it does more, you're leaving performance on the table.

Getting Started

pip install ea-compiler
import ea

kernel = ea.load("fma.ea")

import numpy as np
a = np.random.rand(16_000_000).astype(np.float32)
b = np.random.rand(16_000_000).astype(np.float32)
c = np.random.rand(16_000_000).astype(np.float32)
out = np.zeros(16_000_000, dtype=np.float32)

kernel.fma_f32x8(a, b, c, out)  # 7 ms instead of 46 ms

No Cython. No Numba. No C compiler. No JIT warmup. Write a .ea file, call ea.load(), pass NumPy arrays.

The compiler handles SIMD, the binding handles types, and the hardware handles the rest.


All benchmarks measured on a 2-core machine. Your numbers will be different, probably better (more cores = more bandwidth). The ratios hold. Source code and methodology on GitHub.

Eä is open-source (Apache 2.0). The compiler is ~12,000 lines of Rust with 475+ tests. Documentation. GitHub.

I Wrote a SIMD Compiler in 12K Lines of Rust

And then an LLM optimized the kernels better than I could.


The Elevator Pitch

Eä is a compiler for SIMD kernels. You write a small .ea file, run one command, and call it from Python like a normal function. But it runs at native vectorized speed.

import ea
kernel = ea.load("fma.ea")
result = kernel.fma_f32x8(a, b, c, out)  # 6.6× faster than NumPy

No ctypes. No header files. No build system. The compiler generates the shared library and the Python wrapper. Also Rust, C++, PyTorch, and CMake bindings. It targets x86-64 (AVX2/AVX-512) and AArch64 (NEON).

The whole compiler is 12,000 lines of Rust. 475 tests. One person. I'm not a compiler engineer. But I had a very specific problem, and it turns out that's enough.

Why

I had a problem that kept repeating. I'd write something in Python, profile it, find a hot loop, and think: "this needs to be fast." And I knew that fast meant C. Fast meant SIMD. I don't have deep experience with either, but I knew that's where the performance lives.

So I'd fumble through some C code (with an LLM helping me write it, which honestly made it worse because now the code worked but I didn't fully understand why), fight with ctypes, spend an afternoon debugging pointer arithmetic that the AI got right the first time but I broke while "cleaning up," and eventually get a 5× speedup. Then next week, different project, same dance.

I didn't mind the hard part. Figuring out what the kernel should do, thinking about memory access patterns, deciding on vector widths. That's the interesting problem. What I minded was the plumbing. The header files. The build system. The ctypes declarations. The dtype validation. All of it boilerplate, all of it error-prone, none of it the actual work.

So I thought: what if a compiler could handle the plumbing? You write the kernel in a simple language, something that looks like the pseudocode you'd sketch on a whiteboard, and the compiler handles everything else. Compile to a shared library. Auto-generate the Python wrapper. One command. No Makefile.

The "no glue code" part turned out to be the product. Not the SIMD. Not the compiler. The fact that you go from .ea file to working Python function in one command, with types checked, lengths inferred, and output buffers allocated. That's what makes people actually use it instead of just reading about it.

I didn't know how to build a compiler. But I had the idea, and I wanted to see if it would work.

The First Attempt (10K Lines of Pain)

I should tell you about the compiler I wrote before this one.

It also targeted LLVM. It also generated SIMD code. The codegen was 10,000 lines in a single file. The parser was hand-written but handled way too many features. There were no hard limits on file size, no style rules, no test discipline. I kept adding things (generics, a module system, type inference) because why not?

The codebase became unmaintainable in about three months. I couldn't change anything without breaking something else. The codegen file had functions that called functions that called functions eight levels deep, and half of them were handling edge cases for features nobody used.

I threw it away.

The lesson: a SIMD kernel compiler doesn't need generics. It doesn't need modules. It doesn't need type inference. It needs to compile load, store, fma, and splat correctly, generate clean bindings, and stay small enough that one person can hold it in their head.

The Hard Rules

For the second attempt, I wrote the rules before I wrote the code:

  1. No file exceeds 500 lines. Split before you hit the limit.
  2. Every feature proven by end-to-end test. If it's not tested, it doesn't exist.
  3. No fake functions. If hardware doesn't support an operation, the compiler errors. No silent fallbacks.
  4. No premature features. Don't build what isn't needed yet.
  5. Delete, don't comment. Dead code gets removed.

The 500-line rule was the hardest to follow and the most valuable. It forced me to split the type checker into 7 files, the codegen into 10 files, and the parser into 4 files. Each file does one thing. When I need to change how store is type-checked, I open intrinsics_memory.rs (309 lines) and the answer is right there. No grepping through 10K lines of spaghetti.

Was it frustrating? Constantly. I'd be in the middle of adding a feature, hit 480 lines, and have to stop and refactor before I could finish. But every time I did, the code got better. The refactor always revealed something. A responsibility that should have been split earlier. A function that was doing two things.

The "no premature features" rule was the other hard one. I kept wanting to add generics. Or a module system. Or traits. And every time, I'd ask myself: does this serve the goal of compiling SIMD kernels to callable shared libraries? The answer was always no. Eä is monomorphic by design. You write f32x8, you get f32x8. No hidden specialization, no surprise codegen, no combinatorial explosion of type instances.

It's not a limitation. It's the point.

The Architecture

.ea → Lexer (logos) → Parser → Desugar → Type Check → Codegen (LLVM 18) → .o / .so
                                                                          → .ea.json → ea bind

The most important insight: the desugarer is the most important pass.

Eä has a kernel construct that looks like this:

export kernel vscale(data: *f32, out: *mut f32, factor: f32)
    over i in n step 8
    tail scalar { out[i] = data[i] * factor }
{
    store(out, i, load(data, i) .* splat(factor))
}

The desugarer turns this into a plain function with a while-loop and a tail loop. After desugaring, there are no kernels in the AST. Just functions with loops. The type checker and codegen never see kernel. They only see func.

This means every downstream pass is simpler. The type checker doesn't need special kernel logic. The codegen doesn't need to handle iteration. The desugarer handles all of it: injecting the n parameter, generating the loop variable, building the main loop with step, building the tail loop with the chosen strategy.

The desugar pass is 340 lines. It eliminates an entire class of complexity from the remaining 11,660 lines.

The Binding Generators

This is the part people don't expect. The compiler generates .ea.json metadata describing each exported function's signature. Then ea bind reads the JSON and generates idiomatic wrappers:

ea bind kernel.ea --python --rust --cpp --pytorch --cmake

The Python generator does something clever: length collapsing. If your kernel takes (data: *f32, n: i32), the generated Python function takes just data and fills n from data.size automatically. Output parameters marked with out get auto-allocated. The generated code checks dtypes, casts pointers, and handles all the ctypes plumbing that used to take me an afternoon.

Five binding generators, each 200-460 lines. No serde. The JSON parser is hand-written (65 lines). The generated code is clean enough that you can read it, modify it, and learn from it.

Error Messages (The Quiet Win)

I spent more time on error messages than on codegen.

The compiler has "did you mean?" suggestions with Levenshtein distance, dot-operator hints ("cannot use '+' on vectors, use '.+'"), let mut suggestions, type conversion hints, multi-character underlines showing the full expression span, and clear messages that never leak internal compiler state.

kernel.ea:5:12  error[type]: cannot use '+' on vectors. Use '.+' for element-wise vector operations
        return a + b
               ^^^^^

Nobody notices good error messages. But everyone notices bad ones. The difference between a user who gives up and a user who fixes their code is often one helpful error message.

The Autoresearch System (The Fun Part)

This is where it gets interesting.

Inspired by Andrej Karpathy's autoresearch concept, I built an automated optimization loop: an LLM reads the kernel source, the benchmark results, and the history of what's been tried, then proposes a modified kernel. The system compiles it, benchmarks it across multiple data sizes (to catch cache-fitting illusions), verifies correctness against a C reference, and accepts or rejects the change. Then it iterates.

This is where I had the most fun building Eä.

The first time I ran it on the FMA kernel, it found a 10% improvement in 30 iterations. I thought the kernel was already as good as it could get. The LLM found that 12× unrolling with stream stores beat 4× unrolling with regular stores at DRAM scale. I wouldn't have tried that. It sounds like overkill.

Then I let it run on the matrix multiplication kernel. 56% improvement. It switched from ijk to ikj loop order with 8× k-unrolling. I've heard of loop tiling. I couldn't have told you when to apply it. The LLM didn't need to "know." It just tried it and the benchmark said yes.

The thing that surprised me most: you think you have an optimal kernel. You let the LLM iterate 5 times and it finds 20% improvement. Okay, fine, maybe it wasn't optimal. So you let it iterate 50 times on the already-improved kernel. And it still finds improvements. The search space for kernel optimization is bigger than your intuition.

27 benchmark kernels, all scored on largest-size (real-world) data with GB/s bandwidth metrics. The system includes bottleneck classification that tells the LLM whether a kernel is DRAM-bound (don't bother with compute tricks), compute-bound (try wider SIMD, more accumulators), or mixed. The biggest wins:

KernelImprovementWhat Changed
Bitonic sort97%Replaced O(n²) Shellsort with sorting network
Matmul56%k×8 unroll, cache-friendly access
Conv2d 3×347%4× column unroll, prefetch, restrict
Edge detect41%f32x4 to f32x8 upgrade

The humbling part: I'm not an optimization expert. But it turns out you don't need to be one. You need a benchmark harness, a correctness check, and a system that's willing to try things you wouldn't think of.

The Numbers

Source:          12,000 lines of Rust
Tests:           475 end-to-end
Test method:     compile Eä → link with C → run binary → compare stdout
CI:              x86-64, AArch64 (native), Windows
LLVM backend:    18.1 via inkwell 0.8
Binding targets: Python, Rust, C++, PyTorch, CMake

Performance on a real workload (16M float32 elements):

FMA (fused multiply-add):  6.6× faster than NumPy    37.0 GB/s
Dot product:               matches BLAS               36.6 GB/s
SAXPY:                     2.1× faster than NumPy     35.2 GB/s

That's ~37 GB/s on a system with ~40 GB/s memory bandwidth. Near the hardware limit. There's not much room left, which means the code is doing roughly as little unnecessary work as possible.

What I'd Do Differently

Start with the binding generator. I built the compiler first and added bindings later. But the bindings are what make Eä useful. If I'd started by designing the ideal Python API and worked backward to the compiler, some early decisions would have been different.

Add ea inspect earlier. The instruction analysis tool that shows you vector/scalar instruction counts, FMA operations, load/store ratio, and performance hints. I added it late, but it would have caught optimization issues months earlier.

Write fewer features, sooner. Eä has kernel, foreach, for, while, structs, output annotations, conditional compilation, static assertions, and 30+ intrinsics. Most users need kernel, load, store, fma, and splat. I should have shipped a useful subset earlier and iterated based on real usage.

The Real Story

I'm not a compiler engineer. I don't have a CS degree. I'm the kind of person who has ideas and wants to see if they work.

What changed is the tooling. I built Eä with the help of AI models. Claude for the heavy lifting, my own judgment for the architecture and design decisions. The hard rules came from me (learned the painful way from the first attempt). The implementation speed came from having a capable coding assistant.

A year ago, this would have required a team. Now it's 12,000 lines and one person.

The interesting question isn't "can you build this?" anymore. It's what are you going to build next?

My advice: if you have an idea that feels too ambitious for one person, the calculus has changed. Try it. Set hard rules so the codebase stays manageable. Write tests for everything. And don't be afraid to throw away your first attempt.

I threw away 10K lines of bad compiler and started over. Best decision I made.


Eä is open-source under Apache 2.0. GitHub · Documentation · pip install ea-compiler