ARM / NEON

Eä supports AArch64 with NEON vector instructions. This page documents the differences from x86 and how to write portable kernels.

Vector Width

ARM NEON provides 128-bit vector registers, plus 64-bit D-registers for narrower operations. Available vector types on ARM:

128-bit (standard)64-bit (NEON D-registers)Not Supported
f32x4i8x8, u8x8f32x8, f32x16
f64x2i16x4, u16x4f64x4
i32x4, u32x4i32x2i32x8, i32x16
i16x8, u16x8i16x16
i8x16, u8x16i8x32, u8x32

64-bit vector types are ARM-only. Using them on x86 produces a compile error. 256-bit and 512-bit types are x86-only and produce a compile error on ARM. This is intentional -- Eä does not silently fall back to scalar code.

Unavailable Intrinsics

The following intrinsics are x86-only and produce a compile error on ARM:

IntrinsicReason
movemask(v)No ARM equivalent for extracting lane sign bits to a bitmask
gather(ptr, indices)No hardware gather support in NEON
scatter(ptr, indices, values)AVX-512 only
maddubs_i16(a, b)x86 PMADDUBSW instruction, no NEON equivalent
maddubs_i32(a, b)x86 specific
round_f32x8_i32x8(a)AVX2 (256-bit); use round_f32x4_i32x4 on ARM
pack_sat_i32x8(a, b)AVX2 (256-bit); use pack_sat_i32x4 on ARM
pack_sat_i16x16(a, b)AVX2 (256-bit); use pack_sat_i16x8 on ARM

All other intrinsics (loads, stores, math, reductions, splat, select, shuffle, conversions) work on ARM.

ARM-Specific Intrinsics

Dot Product (ARMv8.2-A, --dotprod)

IntrinsicSignatureDescription
vdot_i32(a, b)(i8x16, i8x16) -> i32x4Signed dot product, groups of 4 per lane. Maps to NEON sdot.

I8MM Matrix Multiply (ARMv8.6-A, --i8mm)

IntrinsicSignatureDescription
smmla_i32(acc, a, b)(i32x4, i8x16, i8x16) -> i32x4Signed x signed 2x8 x 8x2 matrix multiply-accumulate
ummla_i32(acc, a, b)(i32x4, u8x16, u8x16) -> i32x4Unsigned x unsigned matrix multiply-accumulate
usmmla_i32(acc, a, b)(i32x4, u8x16, i8x16) -> i32x4Unsigned x signed matrix multiply-accumulate

The accumulator is explicit. First call uses splat(0) for zero-init. Available on Cortex-A78+, Apple M1+.

Absolute Difference (base NEON)

IntrinsicSignatureDescription
abs_diff(a, b)(T, T) -> TElement-wise absolute difference. Returns `

Supported types: i8x16, u8x16, i16x8, u16x8, i32x4, u32x4. Maps to NEON sabd/uabd (one instruction). No x86 equivalent -- use max(a .- b, b .- a) explicitly on x86.

Widening Multiply (base NEON)

IntrinsicSignatureDescription
wmul_i16(a, b)(i8x8, i8x8) -> i16x8Signed widening multiply
wmul_u16(a, b)(u8x8, u8x8) -> u16x8Unsigned widening multiply
wmul_i32(a, b)(i16x4, i16x4) -> i32x4Signed widening multiply
wmul_u32(a, b)(u16x4, u16x4) -> u32x4Unsigned widening multiply

Input types are 64-bit NEON vectors (D-registers). Output is 128-bit. Maps to NEON smull/umull. No x86 equivalent.

The --dotprod and --i8mm flags enable their respective extensions. Using an intrinsic without its flag produces a compile error with a hint to add it.

Cross-Platform Intrinsics

These intrinsics work on both x86 and ARM with identical semantics:

IntrinsicSignaturex86 instructionARM instruction
shuffle_bytes(table, idx)(u8x16, u8x16) -> u8x16SSSE3 pshufbNEON tbl
sat_add(a, b)(T, T) -> TSSE2 padds/paddusNEON sqadd/uqadd
sat_sub(a, b)(T, T) -> TSSE2 psubs/psubusNEON sqsub/uqsub

sat_add/sat_sub support i8x16, u8x16, i16x8, u16x8. Signed vs unsigned saturation is determined by the element type. No feature flags required (base SSE2 and NEON).

Cross-Compilation

Compile for ARM from an x86 host:

ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu

For intrinsics requiring hardware extensions, add the appropriate flag:

ea kernel.ea --lib --target-triple=aarch64-unknown-linux-gnu --dotprod

When compiling natively on an ARM machine, no special flags are needed (except --dotprod for dot product intrinsics):

ea kernel.ea --lib
ea kernel.ea --lib --dotprod   # for vdot_i32
ea kernel.ea --lib --i8mm     # for smmla_i32, ummla_i32, usmmla_i32

The --avx512 flag is rejected on ARM targets with a compile error. The --i8mm and --dotprod flags are rejected on x86 targets.

Writing Portable Kernels

Strategy 1: Use 128-bit types everywhere

Use f32x4, i32x4, etc. These work on both x86 (SSE) and ARM (NEON).

kernel scale(data: *mut f32, factor: f32) range(n) step(4) {
    let v: f32x4 = load(data, i);
    let f: f32x4 = splat(factor);
    store(data, i, v .* f);
}

This sacrifices throughput on x86 (which could use f32x8 with AVX2) but runs everywhere.

Strategy 2: Separate kernel files

Write platform-specific kernels in separate files and load the right one at runtime:

# kernel_x86.ea  -- uses f32x8 for AVX2
# kernel_arm.ea  -- uses f32x4 for NEON
import platform
if platform.machine() == "aarch64":
    k = ea.load("kernel_arm.ea")
else:
    k = ea.load("kernel_x86.ea")

Both files export the same function signatures, so the calling code does not change.

Strategy 3: Use the kernel construct

The kernel construct with step(N) lets you pick the vector width per file while keeping the loop logic identical. Write two .ea files with different step sizes and vector types, but the same exported function name and parameters.

128-bit Pack and Round on NEON

For 128-bit NEON equivalents of the x86 AVX2 pack/round intrinsics, use the cross-platform 128-bit variants:

  • round_f32x4_i32x4: round-to-nearest-even f32x4 to i32x4. x86: cvtps2dq. ARM: fcvtns.
  • pack_sat_i32x4: saturating narrow (i32x4, i32x4) -> i16x8. x86: packssdw. ARM: sqxtn.
  • pack_sat_i16x8: saturating narrow (i16x8, i16x8) -> i8x16. x86: packsswb. ARM: sqxtn.

The 256-bit variants (round_f32x8_i32x8, pack_sat_i32x8, pack_sat_i16x16) are x86-only and produce a compile error on ARM.