Your First Kernel
Let's write a kernel that scales an array by a constant factor, then upgrade it to use SIMD.
Step 1: Scalar version
Create scale.ea:
export func scale(src: *f32, dst: *mut f32, factor: f32, n: i32) {
let mut i: i32 = 0
while i < n {
dst[i] = src[i] * factor
i = i + 1
}
}
Key things to notice:
exportmakes the function callable from Python (C ABI)*f32is an immutable pointer to float32 — your input array*mut f32is a mutable pointer — your output arrayn: i32is the array length — the caller provides this- All types are explicit. No inference, no ambiguity.
Step 2: Call from Python
import ea
import numpy as np
kernel = ea.load("scale.ea")
src = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32)
dst = np.empty_like(src)
kernel.scale(src, dst, factor=2.0)
print(dst) # [2. 4. 6. 8.]
ea.load() compiles scale.ea the first time, then caches the result in __eacache__/. Subsequent calls load from cache instantly.
Step 3: SIMD version
Now let's process 8 floats at a time:
export func scale_simd(src: *f32, dst: *mut f32, factor: f32, n: i32) {
let s: f32x8 = splat(factor)
let mut i: i32 = 0
while i < n {
let v: f32x8 = load(src, i)
store(dst, i, v .* s)
i = i + 8
}
}
What changed:
f32x8— a vector of 8 floats (256-bit, uses AVX2 on x86)splat(factor)— broadcasts the scalar to all 8 lanesload(src, i)— loads 8 consecutive floats starting at indexiv .* s— element-wise multiply (the dot prefix.means "vector operation")store(dst, i, ...)— writes 8 floats back- We increment by 8, not 1
Important: This only works when
nis a multiple of 8. For arbitrary lengths, see Kernels for tail-handling strategies.
Step 4: Compare performance
import ea
import numpy as np
import time
kernel = ea.load("scale.ea")
src = np.random.randn(10_000_000).astype(np.float32)
dst = np.empty_like(src)
# Eä scalar
start = time.perf_counter()
for _ in range(100):
kernel.scale(src, dst, factor=2.0)
ea_scalar = (time.perf_counter() - start) / 100
# Eä SIMD
start = time.perf_counter()
for _ in range(100):
kernel.scale_simd(src, dst, factor=2.0)
ea_simd = (time.perf_counter() - start) / 100
# NumPy
start = time.perf_counter()
for _ in range(100):
np.multiply(src, 2.0, out=dst)
numpy_time = (time.perf_counter() - start) / 100
print(f"Eä scalar: {ea_scalar*1000:.2f} ms")
print(f"Eä SIMD: {ea_simd*1000:.2f} ms")
print(f"NumPy: {numpy_time*1000:.2f} ms")
For this simple operation, all three will be similar — it's bandwidth-bound (one operation per element loaded). Eä shines on compute-bound workloads where there are multiple operations per element. See the Cookbook for real-world comparisons.