NEON Gather Workaround
gather(p: *f32, idx: i32xN) -> f32xN lowers to AVX2's vgatherdps
on x86 — one instruction, four (or eight) indexed loads, one result
vector. On AArch64 NEON there is no such instruction. Pi 5's Cortex-A76
has no SVE / SVE2 either, so there's no ld1w {z0.s}, p0/z, [x1, z1.s, sxtw] fallback. A real gather is hardware-impossible on this target.
Pre-v1.11.0, Eä errored on gather() for ARM with a message that said
"use a scalar loop on ARM" — true but unhelpful, and the de-facto
workaround everyone arrived at (a stack-buffer round-trip through
memory) was strictly worse than the canonical scalar-compose pattern
already used by llama.cpp's IQ kernels.
v1.11.0 ships two new vector-construction intrinsics —
f32x4_from_scalars and f32x8_from_scalars — and rewrites the ARM
gather() error to point at them by name plus
docs/idioms/neon-gather.md. The pattern
that used to be "figure it out yourself" is now a one-line compose with
a documented idiom.
The IQ3 motivation
Olorin's IQ3_S / IQ3_XXS dequant kernels need exactly this. Each block holds packed 3-bit indices into a 256-entry lookup table of dequantized f32 values. The hot loop reads four indices, fetches four LUT entries, and writes them out as an f32x4 vector.
On x86 the kernel is one line:
// AVX2 gather: works on x86 but errors on ARM with a pointer to f32x{4,8}_from_scalars.
export func lut_dequant_x86(lut: *f32, idx: i32x4, out: *mut f32) {
let v: f32x4 = gather(lut, idx)
store(out, 0, v)
}
That kernel hard-errors on AArch64 at codegen time with:
gather has no NEON equivalent on ARM. Use scalar load_u32 + f32x4_from_scalars (or f32x8_from_scalars) to compose the result explicitly. See docs/idioms/neon-gather.md for the canonical pattern.
The compose pattern
The workaround is to read the four (or eight) values one at a time and build the vector explicitly. The new intrinsic does the building:
// IQ3 LUT dequant pattern: gather four entries from a lookup table.
export func iq3_lut_dequant_lane(lut: *f32, indices: *i32, out: *mut f32) {
let i0: i32 = indices[0]
let i1: i32 = indices[1]
let i2: i32 = indices[2]
let i3: i32 = indices[3]
let v0: f32 = lut[i0]
let v1: f32 = lut[i1]
let v2: f32 = lut[i2]
let v3: f32 = lut[i3]
let v: f32x4 = f32x4_from_scalars(v0, v1, v2, v3)
store(out, 0, v)
}
The same shape works for an 8-wide gather over an i32x8 index vector
via f32x8_from_scalars(...) — same idea, eight scalar loads instead
of four.
For the real per-row loop, the four scalar reads sit visibly inside an ordinary while loop:
// IQ3 LUT dequant row: walk an index array, fetch 4 LUT entries at a time,
// store into the output stream. The four scalar loads sit visibly in the
// source — the programmer sees the cost.
export func iq3_dequant_row(lut: *f32, indices: *i32, out: *mut f32, n: i32) {
let mut i: i32 = 0
while i < n {
let v0: f32 = lut[indices[i]]
let v1: f32 = lut[indices[i + 1]]
let v2: f32 = lut[indices[i + 2]]
let v3: f32 = lut[indices[i + 3]]
let v: f32x4 = f32x4_from_scalars(v0, v1, v2, v3)
store(out, i, v)
i = i + 4
}
}
Build with --target-triple=aarch64-unknown-linux-gnu. LLVM's NEON
lowering folds the insertelement chain inside f32x4_from_scalars
into a sequence of ins v.s[i] instructions — the same code GCC and
Clang produce for vsetq_lane_f32 chains. No stack buffer, no
load-and-shuffle dance.
Why no silent fallback?
Eä's design rule is "the programmer sees the cost." A silent scalar
fallback for gather() on ARM would hide a 4-8× slowdown behind syntax
that looks like SIMD. The compose pattern keeps the cost explicit:
four scalar reads appear as four scalar reads in the source. A
profiler shows four ldr s0, [x1, x2, lsl #2] instructions, not a
deceptively single-line gather that hides them. Reviewer reading the
diff sees the loads. Future you reading the source in six months sees
the loads. There is no hidden performance cliff.
This is the same philosophy that gates --fp16 behind a flag and
errors on f16x8 arithmetic without it: explicit cost beats convenient
syntax that masks slowdown.
When SVE2 lands
If your ARM target is Apple M4, Graviton 3+, or Snapdragon X, SVE2 does
have a real gather (ld1w with a scatter-gather addressing mode). Eä's
SVE2 codegen is deferred to a future release. Until then, the compose
pattern is the universal AArch64 workaround: it works on Pi 5 today, and
it stays correct on M4 / Graviton even after a hardware-gather lowering
lands (it just gets superseded by the better path on that target).
See also
- Terse reference idiom:
docs/idioms/neon-gather.md - Test coverage:
tests/phase14_arm_neon.rs(test_f32x4_from_scalars,test_f32x8_from_scalars,test_gather_on_arm_points_to_compose)