All Intrinsics
Complete reference for every built-in function in Eä.
Memory
load
Load a vector from a pointer at a byte offset. The return type is inferred from context.
let v: f32x8 = load(ptr, i);
Typed Scalar Loads
Load a single scalar value from a pointer at a byte offset.
| Intrinsic | Return Type |
|---|---|
load_f32(ptr, i) | f32 |
load_f64(ptr, i) | f64 |
load_i32(ptr, i) | i32 |
load_i16(ptr, i) | i16 |
load_i8(ptr, i) | i8 |
load_u8(ptr, i) | u8 |
load_u16(ptr, i) | u16 |
load_u32(ptr, i) | u32 |
load_u64(ptr, i) | u64 |
let x: f32 = load_f32(ptr, i);
Typed Vector Loads
Load a full vector from a pointer at a byte offset.
| Intrinsic | Return Type |
|---|---|
load_f32x4(ptr, i) | f32x4 |
load_f32x8(ptr, i) | f32x8 |
load_f32x16(ptr, i) | f32x16 |
load_i32x4(ptr, i) | i32x4 |
load_i32x8(ptr, i) | i32x8 |
load_i16x8(ptr, i) | i16x8 |
load_i8x16(ptr, i) | i8x16 |
load_u8x16(ptr, i) | u8x16 |
load_u8x32(ptr, i) | u8x32 |
let v: f32x8 = load_f32x8(data, i * 32);
store
Write a vector to a pointer at a byte offset.
store(out, i, result);
stream_store
Non-temporal store that bypasses the CPU cache. Use for write-only output
the kernel will not read back soon. Pairs with prefetch_nta (the read-side
non-temporal hint) and fence_nt (the ordering primitive).
stream_store(out, i, v) // vector form — v is f32xN, i32xN, etc.
stream_store(out, i, scalar_value) // scalar form (v1.15.0) — i16/u16/i32/u32/i64/u64
Target lowering (verified on LLVM 18.1.8, x86_64 Zen 4 and aarch64 Cortex-A76):
| Width / form | x86 | aarch64 |
|---|---|---|
| Vector 128-bit (f32x4, i32x4, ...) | movntps / movntdq | stnp d, d, [x] (LLVM splits the 128-bit q-register into a d-pair to use the only available aarch64 NT store) |
| Vector 256-bit (AVX2) | vmovntps / vmovntdq | n/a (256-bit not on NEON) |
| Vector 512-bit (AVX-512) | vmovntps / vmovntdq zmm | n/a |
| Scalar i64 / u64 | movnti (SSE2, 64-bit mode) | stnp w, w, [x] (LLVM splits the i64 into a w-pair to use stnp; emits an lsr for the high half) |
| Scalar i32 / u32 | movnti (SSE2) | plain str — NT hint silently dropped |
| Scalar i16 / u16 | regular mov — NT hint silently dropped | plain strh — NT hint silently dropped |
aarch64 has no scalar non-temporal store instruction. The only NT store on
aarch64 is stnp (Store Non-temporal Pair), which requires two operands. LLVM
18 honors !nontemporal only when it can synthesize an stnp:
- 64-bit scalars (i64/u64) self-pair to a
wregister pair — NT hint preserved. - 128-bit vectors self-pair to a
dregister pair — NT hint preserved. - 32-bit and 16-bit scalars have no pair-friendly form — LLVM emits plain
str/strhand the NT hint is dropped silently.
For aarch64 NT semantics, prefer 64-bit or wider element widths. The i32/u32
and i16/u16 scalar overloads still type-check and run, but provide no cache-
bypass benefit on aarch64; the same is true of i16/u16 on x86 (no movnti16
exists). They ship for cross-platform shape symmetry — a single Eä kernel using
stream_store compiles and runs on both targets without per-width branching.
Note also that LLVM 18 does not fuse two adjacent stream_store(*mut i32, ...)
calls into a single stnp pair on aarch64 — they lower to a regular stp with
the NT hint dropped. If you need NT-paired stores on aarch64, write the data as
i64 (two 32-bit values packed) or as a vector type.
Alignment contract:
Vector stream_store requires the destination pointer plus byte offset to
be aligned to the vector's natural size (16 bytes for 128-bit, 32 bytes for
256-bit, 64 bytes for 512-bit). Misaligned NT vector stores raise a general
protection fault on x86. Scalar stream_store requires natural alignment to
the scalar size on x86 (4-byte for i32/u32, 8-byte for i64/u64). Callers
must provide aligned buffers; Eä does not insert runtime alignment checks.
Ordering contract:
NT stores are weakly ordered on x86 (write-combining memory order). Other
cores or subsequent reads in the same thread may observe them out of
program order. For cross-thread visibility, the typical pathway is through
a host-side synchronization primitive after the kernel returns
(pthread_join, rayon::scope, WaitGroup.Wait) — these provide release
semantics that flush WC buffers. For intra-kernel ordering (writing then
reading the same memory in the same kernel call), use fence_nt()
explicitly. Eä does not insert an implicit fence at kernel return.
When NOT to use:
Do not use stream_store for working buffers the same kernel reads back.
The non-temporal hint asks the cache to not keep the line; if the kernel
reads the data soon afterward, the read goes to DRAM and is slower than a
regular store followed by a cache hit. Working-buffer examples that
should use plain store:
- Softmax accumulators (e.g.
scores_bufin attention kernels) - FWHT scratch arrays (e.g.
scratchin JL-projection kernels) - Per-iteration partial sums or running statistics
stream_store is appropriate when the destination is a final output passed
to the next kernel call, a memory region the current kernel never re-reads,
or a buffer that will not be touched again until a downstream consumer
pulls it from DRAM later.
fence_nt
Store-store memory barrier providing intra-kernel ordering of preceding
stream_store operations. Zero arguments, returns void.
fence_nt()
Target lowering:
| Target | Instruction |
|---|---|
| x86 | sfence (via @llvm.x86.sse.sfence) |
| aarch64 | dmb ishst (via @llvm.aarch64.dmb with operand 10) |
These are the narrowest available barriers for store-only ordering —
explicit target intrinsics rather than the IR-level fence release, which
would lower to mfence on x86 and dmb ish on aarch64 (both heavier than
needed for NT-store ordering).
Semantics:
fence_nt() orders stream_store writes relative to each other and
relative to subsequent regular stores. It does not order stores relative
to subsequent loads — for a write-then-read-back pattern in the same kernel,
a full barrier (mfence on x86, dmb sy on aarch64) is needed instead.
Eä does not currently expose a full-barrier intrinsic.
When to use:
Use fence_nt() when the same kernel writes via stream_store to multiple
non-overlapping regions in a defined order and a later kernel (or downstream
reader) relies on observing those writes in the same order. This is
uncommon — most callers don't need it, because cross-thread visibility
comes from the host's sync primitive (pthread_join, rayon::scope,
WaitGroup.Wait) which already provides release semantics that flush
write-combining buffers between threads.
When NOT to use:
- Between successive
stream_storecalls to different addresses if no store-ordering requirement exists — NT stores to the same address complete in program order regardless of fences. - At the end of a kernel as "insurance" — the caller's sync primitive handles cross-thread fencing more efficiently after the kernel returns.
- For write-then-read-back patterns —
fence_nt()does not provide store- to-load ordering. Use a regularstorefor the working data, or do not read NT-written data back in the same kernel.
load_masked
Masked vector load. Lanes where the mask is false are not loaded.
let v: f32x8 = load_masked(ptr, i, mask);
store_masked
Masked vector store. Only lanes where the mask is true are written.
store_masked(out, i, value, mask);
gather
Load elements from scattered memory addresses using an index vector. x86 only -- not available on ARM.
let v: f32x8 = gather(ptr, indices);
scatter
Store elements to scattered memory addresses using an index vector. AVX-512 only (--avx512 flag required).
scatter(ptr, indices, values);
prefetch
Issue a read-intent prefetch hint to bring data into all cache levels (T0).
prefetch(ptr, i)
Lowers to prefetcht0 on x86 / prfm pldl1keep on aarch64.
prefetch_write
Issue a write-intent prefetch hint. Signals the cache coherence protocol to acquire the target line in modified state ahead of the store, avoiding a read-for-ownership stall when the store retires. Use on the upcoming write target of memory-bound store-heavy kernels (e.g. chacha20 ciphertext output, dequantize destinations).
prefetch_write(ptr, i)
Lowers to prefetchw on x86 (requires PRFCHW CPUID; falls back to
prefetcht0 on older CPUs) / prfm pstl1keep on aarch64.
prefetch_nta
Issue a non-temporal prefetch hint — bring the line into L1 only and mark it for early eviction. Use for streaming reads the kernel touches exactly once and shouldn't pollute L1/L2 with (e.g. Q4 dequantize input, large one-pass scans).
prefetch_nta(ptr, i)
Lowers to prefetchnta on x86 / prfm pldl1strm on aarch64.
Prefetch hint summary
| Intrinsic | (rw, locality) | x86 | aarch64 |
|---|---|---|---|
prefetch | (0, 3) | prefetcht0 | prfm pldl1keep |
prefetch_write | (1, 3) | prefetchw | prfm pstl1keep |
prefetch_nta | (0, 0) | prefetchnta | prfm pldl1strm |
All three accept (ptr, integer-offset), return void, and are valid in
any expression-statement position inside a function body.
Math
sqrt
Square root. Works on scalar f32/f64 and all float vector types.
let y: f32 = sqrt(x);
let v: f32x8 = sqrt(vec);
rsqrt
Reciprocal square root (approximate). Scalar f32 and f32 vector types.
let y: f32 = rsqrt(x);
let v: f32x8 = rsqrt(vec);
exp
Exponential function. Float types.
let y: f32 = exp(x);
fma
Fused multiply-add: computes a * b + c in a single operation with one rounding step. Works on scalar f32 and all float vector types.
let y: f32 = fma(a, b, c);
let v: f32x8 = fma(va, vb, vc);
min
Element-wise minimum. Works on scalar (i32, f32, f64) and vector types.
let m: i32 = min(a, b);
let v: f32x8 = min(va, vb);
max
Element-wise maximum. Works on scalar (i32, f32, f64) and vector types.
let m: i32 = max(a, b);
let v: f32x8 = max(va, vb);
Reduction
Reduce a vector to a single scalar value.
reduce_add
Sum all lanes.
let sum: f32 = reduce_add(v); // f32x8 -> f32
let sum: i32 = reduce_add(iv); // i32x8 -> i32
reduce_max
Maximum across all lanes.
let m: f32 = reduce_max(v);
reduce_min
Minimum across all lanes.
let m: f32 = reduce_min(v);
reduce_add_fast
Unordered float reduction. Faster than reduce_add but does not guarantee summation order, so results may differ slightly due to floating-point rounding. Float vectors only.
let sum: f32 = reduce_add_fast(v);
Vector
splat
Broadcast a scalar value to all lanes of a vector. The vector type is inferred from context.
let v: f32x8 = splat(1.0);
shuffle
Compile-time index-driven vector shuffle. Two forms:
Single-source — permute lanes within one vector.
let reversed: f32x4 = shuffle(v, [3, 2, 1, 0])
Each index is in [0, width).
Two-source — pick lanes from two vectors of the same type.
let zipped: f32x8 = shuffle(a, b, [0, 8, 1, 9, 2, 10, 3, 11])
Indices in [0, width) select from a; indices in [width, 2 * width) select lane i - width from b. Common patterns: interleave (lower-half zip), blend (pick lanes by index), and concatenate-permute (lower-half from one source, upper from the other with permutation).
permute_runtime
Runtime-indexed permute of an 8-lane vector. result[k] = table[indices[k] & 0x7]. Index lanes are 3-bit-masked by the hardware — the upper 29 bits of each i32 index are ignored. Out-of-range indices wrap; they do not trap.
- x86: single instruction (
vpermpsfor f32,vpermdfor i32). Requires AVX2. - ARM (NEON): not supported. Compile-time error pointing at the NEON runtime-permute idiom.
let table: f32x8 = load(matrix_row, 0) // 6 active + 2 don't-care
let indices: i32x8 = load(types_v, 0) // values in [0..5]
let strengths: f32x8 = permute_runtime(table, indices)
| Signature | (f32x8, i32x8) -> f32x8, (i32x8, i32x8) -> i32x8 |
|---|
See also: shuffle (compile-time indices), gather (pointer-indexed), shuffle_bytes (byte-level, 16-byte table, cross-platform).
select
Per-lane conditional select. Where the mask is true, take from a; where false, take from b.
let result: f32x8 = select(mask, a, b);
movemask
Extract a comparison result bitmask from a boolean vector to a scalar i32. Each bit corresponds to the sign bit of one lane. x86 only -- not available on ARM.
let bits: i32 = movemask(cmp_result);
Conversion
Scalar Casts
| Intrinsic | Description |
|---|---|
to_f32(x) | Convert to f32 |
to_f64(x) | Convert to f64 |
to_i32(x) | Convert to i32 |
to_i64(x) | Convert to i64 |
let f: f32 = to_f32(i);
let n: i32 = to_i32(x);
Widening Conversions
Widen narrow integer lanes to wider float or integer lanes. Only the first N lanes of the input are consumed.
| Intrinsic | Input | Output |
|---|---|---|
widen_i8_f32x4(v) | i8x16 | f32x4 |
widen_u8_f32x4(v) | u8x16 | f32x4 |
widen_i8_f32x8(v) | i8x16 | f32x8 |
widen_u8_f32x8(v) | u8x16 | f32x8 |
widen_i8_f32x16(v) | i8x16 | f32x16 |
widen_u8_f32x16(v) | u8x16 | f32x16 |
widen_u8_i32x4(v) | u8x16 | i32x4 |
widen_u8_i32x8(v) | u8x16 | i32x8 |
widen_u8_i32x16(v) | u8x16 | i32x16 |
let pixels: f32x8 = widen_u8_f32x8(raw_bytes);
Lane-offset variants
The _4, _8, _12 suffixes select which 4 bytes of the input to widen, eliminating the need for a shuffle before widening:
| Intrinsic | Input | Output | Bytes used |
|---|---|---|---|
widen_u8_f32x4_4(v) | u8x16 | f32x4 | 4-7 |
widen_u8_f32x4_8(v) | u8x16 | f32x4 | 8-11 |
widen_u8_f32x4_12(v) | u8x16 | f32x4 | 12-15 |
widen_i8_f32x4_4(v) | i8x16 | f32x4 | 4-7 |
widen_i8_f32x4_8(v) | i8x16 | f32x4 | 8-11 |
widen_i8_f32x4_12(v) | i8x16 | f32x4 | 12-15 |
widen_u8_i32x4_4(v) | u8x16 | i32x4 | 4-7 |
widen_u8_i32x4_8(v) | u8x16 | i32x4 | 8-11 |
widen_u8_i32x4_12(v) | u8x16 | i32x4 | 12-15 |
Process all 16 bytes of a u8x16 as 4 groups of f32x4 without any shuffles:
let f0: f32x4 = widen_u8_f32x4(v) // bytes 0-3
let f1: f32x4 = widen_u8_f32x4_4(v) // bytes 4-7
let f2: f32x4 = widen_u8_f32x4_8(v) // bytes 8-11
let f3: f32x4 = widen_u8_f32x4_12(v) // bytes 12-15
Narrowing Conversions
Convert wider lanes to narrower lanes, with clamping and rounding.
| Intrinsic | Input | Output |
|---|---|---|
narrow_f32x4_i8(v) | f32x4 | i8 (4 bytes) |
let packed = narrow_f32x4_i8(float_pixels);
Multiply-Add Byte Pairs
Multiply unsigned bytes by signed bytes and add adjacent pairs. x86 only.
| Intrinsic | Signature | Description |
|---|---|---|
maddubs_i16(a, b) | (u8x16, i8x16) -> i16x8 | Multiply and add adjacent pairs to 16-bit |
maddubs_i32(a, b) | (u8x16, i8x16) -> i32x4 | Multiply and add adjacent quads to 32-bit |
let products: i16x8 = maddubs_i16(unsigned_bytes, signed_weights);
vdot_i32
Signed integer dot product: multiplies groups of 4 i8 pairs and sums each group into one i32 lane. ARM only -- requires --dotprod flag (ARMv8.2-A dot product extension). Maps to NEON sdot.
let dot: i32x4 = vdot_i32(activations, weights);
acc = acc .+ vdot_i32(a, b); // accumulate explicitly
| Signature | (i8x16, i8x16) -> i32x4 |
|---|
I8MM Matrix Multiply
Matrix multiply-accumulate on int8 data. ARM only -- requires --i8mm flag (ARMv8.6-A I8MM extension). Available on Cortex-A78+, Apple M1+.
| Intrinsic | Signature | Description |
|---|---|---|
smmla_i32(acc, a, b) | (i32x4, i8x16, i8x16) -> i32x4 | Signed x signed |
ummla_i32(acc, a, b) | (i32x4, u8x16, u8x16) -> i32x4 | Unsigned x unsigned |
usmmla_i32(acc, a, b) | (i32x4, u8x16, i8x16) -> i32x4 | Unsigned x signed |
The accumulator is the first argument. Each instruction performs a 2x8 x 8x2 matrix multiply and adds the result to the accumulator. Use splat(0) as accumulator for the first iteration.
let zero: i32x4 = splat(0);
let result: i32x4 = smmla_i32(zero, activations, weights);
// accumulate over multiple chunks:
acc = smmla_i32(acc, next_a, next_b);
Widening Multiply
Multiply narrow integer lanes and produce wider output. ARM only (base NEON). Input types are 64-bit NEON vectors.
| Intrinsic | Signature | Description |
|---|---|---|
wmul_i16(a, b) | (i8x8, i8x8) -> i16x8 | Signed 8-bit to 16-bit |
wmul_u16(a, b) | (u8x8, u8x8) -> u16x8 | Unsigned 8-bit to 16-bit |
wmul_i32(a, b) | (i16x4, i16x4) -> i32x4 | Signed 16-bit to 32-bit |
wmul_u32(a, b) | (u16x4, u16x4) -> u32x4 | Unsigned 16-bit to 32-bit |
let wide: i16x8 = wmul_i16(bytes_a, bytes_b);
Absolute Difference
Element-wise |a - b|. ARM only (base NEON). Maps to a single instruction (sabd/uabd).
| Intrinsic | Supported Types |
|---|---|
abs_diff(a, b) | i8x16, u8x16, i16x8, u16x8, i32x4, u32x4 |
let diff: u8x16 = abs_diff(frame_a, frame_b);
Saturating Arithmetic
Addition and subtraction that clamp to the type's min/max instead of wrapping on overflow. Cross-platform (ARM NEON + x86 SSE2).
| Intrinsic | Supported Types |
|---|---|
sat_add(a, b) | i8x16, u8x16, i16x8, u16x8 |
sat_sub(a, b) | i8x16, u8x16, i16x8, u16x8 |
Signed vs unsigned saturation is determined by the element type. Both arguments must have the same type.
let bright: u8x16 = sat_add(pixels, boost); // clamps at 255, never wraps
let dark: u8x16 = sat_sub(pixels, reduce); // clamps at 0, never wraps
shuffle_bytes
Byte-level table lookup: each byte in indices selects a byte from table. Cross-platform. x86: SSSE3 pshufb. ARM: NEON tbl. Out-of-range indices (>15) zero the lane on both platforms.
let result: u8x16 = shuffle_bytes(table, indices);
| Signature | (u8x16, u8x16) -> u8x16 |
|---|
Rounding & Packing
| Intrinsic | Signature | Description | Platform |
|---|---|---|---|
round_f32x4_i32x4 | (f32x4) -> i32x4 | Round-to-nearest-even. x86: cvtps2dq. ARM: fcvtns. | cross-platform |
pack_sat_i32x4 | (i32x4, i32x4) -> i16x8 | Saturating narrow. x86: packssdw. ARM: sqxtn. | cross-platform |
pack_sat_i16x8 | (i16x8, i16x8) -> i8x16 | Saturating narrow. x86: packsswb. ARM: sqxtn. | cross-platform |
round_f32x8_i32x8 | (f32x8) -> i32x8 | Round-to-nearest-even float to integer. x86: vcvtps2dq (AVX2). | x86-only |
pack_sat_i32x8 | (i32x8, i32x8) -> i16x16 | Saturating narrow two i32x8 into i16x16. x86: vpackssdw (AVX2). | x86-only |
pack_sat_i16x16 | (i16x16, i16x16) -> i8x32 | Saturating narrow two i16x16 into i8x32. x86: vpacksswb (AVX2). | x86-only |
pack_sat_i32x8andpack_sat_i16x16emit avpermqfixup after the AVX2 pack to produce sequential output[a0..a7, b0..b7], matching what the 128-bit variants produce.
Debug
println
Print a value to stdout. Accepts scalars (i32, i64, u8, u16, u32, u64, f32, f64, bool), string literals, and vector types. Lowers to C printf. No format strings.
println(42);
println(3.14);
println("hello");
println(my_vector);