ARM / NEON

Eä supports AArch64 with NEON vector instructions. This page documents the differences from x86 and how to write portable kernels.

Vector Width

ARM NEON provides 128-bit vector registers, plus 64-bit D-registers for narrower operations. Available vector types on ARM:

128-bit (standard)	64-bit (NEON D-registers)	Not Supported
`f32x4`	`i8x8`, `u8x8`	`f32x8`, `f32x16`
`f64x2`	`i16x4`, `u16x4`	`f64x4`
`i32x4`, `u32x4`	`i32x2`	`i32x8`, `i32x16`
`i16x8`, `u16x8`		`i16x16`
`i8x16`, `u8x16`		`i8x32`, `u8x32`

64-bit vector types are ARM-only. Using them on x86 produces a compile error. 256-bit and 512-bit types are x86-only and produce a compile error on ARM. This is intentional -- Eä does not silently fall back to scalar code.

Unavailable Intrinsics

The following intrinsics are x86-only and produce a compile error on ARM:

Intrinsic	Reason
`movemask(v)`	No ARM equivalent for extracting lane sign bits to a bitmask
`gather(ptr, indices)`	No hardware gather support in NEON
`scatter(ptr, indices, values)`	AVX-512 only
`maddubs_i16(a, b)`	x86 PMADDUBSW instruction, no NEON equivalent
`maddubs_i32(a, b)`	x86 specific
`round_f32x8_i32x8(a)`	AVX2 (256-bit); use `round_f32x4_i32x4` on ARM
`pack_sat_i32x8(a, b)`	AVX2 (256-bit); use `pack_sat_i32x4` on ARM
`pack_sat_i16x16(a, b)`	AVX2 (256-bit); use `pack_sat_i16x8` on ARM

All other intrinsics (loads, stores, math, reductions, splat, select, shuffle, conversions) work on ARM.

ARM-Specific Intrinsics

Dot Product (ARMv8.2-A, `--dotprod`)

Intrinsic	Signature	Description
`vdot_i32(a, b)`	`(i8x16, i8x16) -> i32x4`	Signed dot product, groups of 4 per lane. Maps to NEON `sdot`.

I8MM Matrix Multiply (ARMv8.6-A, `--i8mm`)

Intrinsic	Signature	Description
`smmla_i32(acc, a, b)`	`(i32x4, i8x16, i8x16) -> i32x4`	Signed x signed 2x8 x 8x2 matrix multiply-accumulate
`ummla_i32(acc, a, b)`	`(i32x4, u8x16, u8x16) -> i32x4`	Unsigned x unsigned matrix multiply-accumulate
`usmmla_i32(acc, a, b)`	`(i32x4, u8x16, i8x16) -> i32x4`	Unsigned x signed matrix multiply-accumulate

The accumulator is explicit. First call uses splat(0) for zero-init. Available on Cortex-A78+, Apple M1+.

Absolute Difference (base NEON)

Intrinsic	Signature	Description
`abs_diff(a, b)`	`(T, T) -> T`	Element-wise absolute difference. Returns `

Supported types: i8x16, u8x16, i16x8, u16x8, i32x4, u32x4. Maps to NEON sabd/uabd (one instruction). No x86 equivalent -- use max(a .- b, b .- a) explicitly on x86.

Widening Multiply (base NEON)

Intrinsic	Signature	Description
`wmul_i16(a, b)`	`(i8x8, i8x8) -> i16x8`	Signed widening multiply
`wmul_u16(a, b)`	`(u8x8, u8x8) -> u16x8`	Unsigned widening multiply
`wmul_i32(a, b)`	`(i16x4, i16x4) -> i32x4`	Signed widening multiply
`wmul_u32(a, b)`	`(u16x4, u16x4) -> u32x4`	Unsigned widening multiply

Input types are 64-bit NEON vectors (D-registers). Output is 128-bit. Maps to NEON smull/umull. No x86 equivalent.

The --dotprod and --i8mm flags enable their respective extensions. Using an intrinsic without its flag produces a compile error with a hint to add it.

Cross-Platform Intrinsics

These intrinsics work on both x86 and ARM with identical semantics:

Intrinsic	Signature	x86 instruction	ARM instruction
`shuffle_bytes(table, idx)`	`(u8x16, u8x16) -> u8x16`	SSSE3 `pshufb`	NEON `tbl`
`sat_add(a, b)`	`(T, T) -> T`	SSE2 `padds`/`paddus`	NEON `sqadd`/`uqadd`
`sat_sub(a, b)`	`(T, T) -> T`	SSE2 `psubs`/`psubus`	NEON `sqsub`/`uqsub`

sat_add/sat_sub support i8x16, u8x16, i16x8, u16x8. Signed vs unsigned saturation is determined by the element type. No feature flags required (base SSE2 and NEON).

Cross-Compilation

Compile for ARM from an x86 host:

ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu

For intrinsics requiring hardware extensions, add the appropriate flag:

ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu --dotprod

When compiling natively on an ARM machine, no special flags are needed (except --dotprod for dot product intrinsics):

ea kernel.ea --lib
ea kernel.ea --lib --dotprod   # for vdot_i32
ea kernel.ea --lib --i8mm     # for smmla_i32, ummla_i32, usmmla_i32

The --avx512 flag is rejected on ARM targets with a compile error. The --i8mm and --dotprod flags are rejected on x86 targets.

Writing Portable Kernels

Strategy 1: Use 128-bit types everywhere

Use f32x4, i32x4, etc. These work on both x86 (SSE) and ARM (NEON).

kernel scale(data: *mut f32, factor: f32) range(n) step(4) {
    let v: f32x4 = load(data, i);
    let f: f32x4 = splat(factor);
    store(data, i, v .* f);
}

This sacrifices throughput on x86 (which could use f32x8 with AVX2) but runs everywhere.

Strategy 2: Separate kernel files

Write platform-specific kernels in separate files and load the right one at runtime:

# kernel_x86.ea  -- uses f32x8 for AVX2
# kernel_arm.ea  -- uses f32x4 for NEON

import platform
if platform.machine() == "aarch64":
    k = ea.load("kernel_arm.ea")
else:
    k = ea.load("kernel_x86.ea")

Both files export the same function signatures, so the calling code does not change.

Strategy 3: Use the kernel construct

The kernel construct with step(N) lets you pick the vector width per file while keeping the loop logic identical. Write two .ea files with different step sizes and vector types, but the same exported function name and parameters.

128-bit Pack and Round on NEON

For 128-bit NEON equivalents of the x86 AVX2 pack/round intrinsics, use the cross-platform 128-bit variants:

round_f32x4_i32x4: round-to-nearest-even f32x4 to i32x4. x86: cvtps2dq. ARM: fcvtns.
pack_sat_i32x4: saturating narrow (i32x4, i32x4) -> i16x8. x86: packssdw. ARM: sqxtn.
pack_sat_i16x8: saturating narrow (i16x8, i16x8) -> i8x16. x86: packsswb. ARM: sqxtn.

The 256-bit variants (round_f32x8_i32x8, pack_sat_i32x8, pack_sat_i16x16) are x86-only and produce a compile error on ARM.

Eä — Compute Kernel Compiler