ARM / NEON

Ea supports AArch64 with NEON vector instructions. This page documents the differences from x86 and how to write portable kernels.

Vector Width

ARM NEON provides 128-bit vector registers. Only 128-bit vector types are available:

SupportedNot Supported
f32x4f32x8, f32x16
f64x2f64x4, f64x8
i32x4i32x8, i32x16
i16x8i16x16
i8x16i8x32
u8x16u8x32
u16x8u16x16

Using a 256-bit or 512-bit vector type on an ARM target produces a compile error. This is intentional -- Ea does not silently fall back to scalar code.

Unavailable Intrinsics

The following intrinsics are x86-only and produce a compile error on ARM:

IntrinsicReason
movemask(v)No ARM equivalent for extracting lane sign bits to a bitmask
gather(ptr, indices)No hardware gather support in NEON
scatter(ptr, indices, values)AVX-512 only
maddubs_i16(a, b)x86 PMADDUBSW instruction, no NEON equivalent
maddubs_i32(a, b)x86 specific

All other intrinsics (loads, stores, math, reductions, splat, select, shuffle, conversions) work on ARM.

Cross-Compilation

Compile for ARM from an x86 host:

ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu

When compiling natively on an ARM machine, no special flags are needed:

ea kernel.ea --lib

The --avx512 flag is rejected on ARM targets with a compile error.

Writing Portable Kernels

Strategy 1: Use 128-bit types everywhere

Use f32x4, i32x4, etc. These work on both x86 (SSE) and ARM (NEON).

kernel scale(data: *mut f32, factor: f32) range(n) step(4) {
    let v: f32x4 = load(data, i);
    let f: f32x4 = splat(factor);
    store(data, i, v .* f);
}

This sacrifices throughput on x86 (which could use f32x8 with AVX2) but runs everywhere.

Strategy 2: Separate kernel files

Write platform-specific kernels in separate files and load the right one at runtime:

# kernel_x86.ea  -- uses f32x8 for AVX2
# kernel_arm.ea  -- uses f32x4 for NEON
import platform
if platform.machine() == "aarch64":
    k = ea.load("kernel_arm.ea")
else:
    k = ea.load("kernel_x86.ea")

Both files export the same function signatures, so the calling code does not change.

Strategy 3: Use the kernel construct

The kernel construct with step(N) lets you pick the vector width per file while keeping the loop logic identical. Write two .ea files with different step sizes and vector types, but the same exported function name and parameters.