All Intrinsics

Complete reference for every built-in function in Eä.

Memory

load

Load a vector from a pointer at a byte offset. The return type is inferred from context.

let v: f32x8 = load(ptr, i);

Typed Scalar Loads

Load a single scalar value from a pointer at a byte offset.

IntrinsicReturn Type
load_f32(ptr, i)f32
load_f64(ptr, i)f64
load_i32(ptr, i)i32
load_i16(ptr, i)i16
load_i8(ptr, i)i8
load_u8(ptr, i)u8
load_u16(ptr, i)u16
load_u32(ptr, i)u32
load_u64(ptr, i)u64
let x: f32 = load_f32(ptr, i);

Typed Vector Loads

Load a full vector from a pointer at a byte offset.

IntrinsicReturn Type
load_f32x4(ptr, i)f32x4
load_f32x8(ptr, i)f32x8
load_f32x16(ptr, i)f32x16
load_i32x4(ptr, i)i32x4
load_i32x8(ptr, i)i32x8
load_i16x8(ptr, i)i16x8
load_i8x16(ptr, i)i8x16
load_u8x16(ptr, i)u8x16
load_u8x32(ptr, i)u8x32
let v: f32x8 = load_f32x8(data, i * 32);

store

Write a vector to a pointer at a byte offset.

store(out, i, result);

stream_store

Non-temporal store that bypasses the CPU cache. Use for write-only output the kernel will not read back soon. Pairs with prefetch_nta (the read-side non-temporal hint) and fence_nt (the ordering primitive).

stream_store(out, i, v)              // vector form — v is f32xN, i32xN, etc.
stream_store(out, i, scalar_value)   // scalar form (v1.15.0) — i16/u16/i32/u32/i64/u64

Target lowering (verified on LLVM 18.1.8, x86_64 Zen 4 and aarch64 Cortex-A76):

Width / formx86aarch64
Vector 128-bit (f32x4, i32x4, ...)movntps / movntdqstnp d, d, [x] (LLVM splits the 128-bit q-register into a d-pair to use the only available aarch64 NT store)
Vector 256-bit (AVX2)vmovntps / vmovntdqn/a (256-bit not on NEON)
Vector 512-bit (AVX-512)vmovntps / vmovntdq zmmn/a
Scalar i64 / u64movnti (SSE2, 64-bit mode)stnp w, w, [x] (LLVM splits the i64 into a w-pair to use stnp; emits an lsr for the high half)
Scalar i32 / u32movnti (SSE2)plain str — NT hint silently dropped
Scalar i16 / u16regular mov — NT hint silently droppedplain strh — NT hint silently dropped

aarch64 has no scalar non-temporal store instruction. The only NT store on aarch64 is stnp (Store Non-temporal Pair), which requires two operands. LLVM 18 honors !nontemporal only when it can synthesize an stnp:

  • 64-bit scalars (i64/u64) self-pair to a w register pair — NT hint preserved.
  • 128-bit vectors self-pair to a d register pair — NT hint preserved.
  • 32-bit and 16-bit scalars have no pair-friendly form — LLVM emits plain str / strh and the NT hint is dropped silently.

For aarch64 NT semantics, prefer 64-bit or wider element widths. The i32/u32 and i16/u16 scalar overloads still type-check and run, but provide no cache- bypass benefit on aarch64; the same is true of i16/u16 on x86 (no movnti16 exists). They ship for cross-platform shape symmetry — a single Eä kernel using stream_store compiles and runs on both targets without per-width branching.

Note also that LLVM 18 does not fuse two adjacent stream_store(*mut i32, ...) calls into a single stnp pair on aarch64 — they lower to a regular stp with the NT hint dropped. If you need NT-paired stores on aarch64, write the data as i64 (two 32-bit values packed) or as a vector type.

Alignment contract:

Vector stream_store requires the destination pointer plus byte offset to be aligned to the vector's natural size (16 bytes for 128-bit, 32 bytes for 256-bit, 64 bytes for 512-bit). Misaligned NT vector stores raise a general protection fault on x86. Scalar stream_store requires natural alignment to the scalar size on x86 (4-byte for i32/u32, 8-byte for i64/u64). Callers must provide aligned buffers; Eä does not insert runtime alignment checks.

Ordering contract:

NT stores are weakly ordered on x86 (write-combining memory order). Other cores or subsequent reads in the same thread may observe them out of program order. For cross-thread visibility, the typical pathway is through a host-side synchronization primitive after the kernel returns (pthread_join, rayon::scope, WaitGroup.Wait) — these provide release semantics that flush WC buffers. For intra-kernel ordering (writing then reading the same memory in the same kernel call), use fence_nt() explicitly. Eä does not insert an implicit fence at kernel return.

When NOT to use:

Do not use stream_store for working buffers the same kernel reads back. The non-temporal hint asks the cache to not keep the line; if the kernel reads the data soon afterward, the read goes to DRAM and is slower than a regular store followed by a cache hit. Working-buffer examples that should use plain store:

  • Softmax accumulators (e.g. scores_buf in attention kernels)
  • FWHT scratch arrays (e.g. scratch in JL-projection kernels)
  • Per-iteration partial sums or running statistics

stream_store is appropriate when the destination is a final output passed to the next kernel call, a memory region the current kernel never re-reads, or a buffer that will not be touched again until a downstream consumer pulls it from DRAM later.

fence_nt

Store-store memory barrier providing intra-kernel ordering of preceding stream_store operations. Zero arguments, returns void.

fence_nt()

Target lowering:

TargetInstruction
x86sfence (via @llvm.x86.sse.sfence)
aarch64dmb ishst (via @llvm.aarch64.dmb with operand 10)

These are the narrowest available barriers for store-only ordering — explicit target intrinsics rather than the IR-level fence release, which would lower to mfence on x86 and dmb ish on aarch64 (both heavier than needed for NT-store ordering).

Semantics:

fence_nt() orders stream_store writes relative to each other and relative to subsequent regular stores. It does not order stores relative to subsequent loads — for a write-then-read-back pattern in the same kernel, a full barrier (mfence on x86, dmb sy on aarch64) is needed instead. Eä does not currently expose a full-barrier intrinsic.

When to use:

Use fence_nt() when the same kernel writes via stream_store to multiple non-overlapping regions in a defined order and a later kernel (or downstream reader) relies on observing those writes in the same order. This is uncommon — most callers don't need it, because cross-thread visibility comes from the host's sync primitive (pthread_join, rayon::scope, WaitGroup.Wait) which already provides release semantics that flush write-combining buffers between threads.

When NOT to use:

  • Between successive stream_store calls to different addresses if no store-ordering requirement exists — NT stores to the same address complete in program order regardless of fences.
  • At the end of a kernel as "insurance" — the caller's sync primitive handles cross-thread fencing more efficiently after the kernel returns.
  • For write-then-read-back patterns — fence_nt() does not provide store- to-load ordering. Use a regular store for the working data, or do not read NT-written data back in the same kernel.

load_masked

Masked vector load. Lanes where the mask is false are not loaded.

let v: f32x8 = load_masked(ptr, i, mask);

store_masked

Masked vector store. Only lanes where the mask is true are written.

store_masked(out, i, value, mask);

gather

Load elements from scattered memory addresses using an index vector. x86 only -- not available on ARM.

let v: f32x8 = gather(ptr, indices);

scatter

Store elements to scattered memory addresses using an index vector. AVX-512 only (--avx512 flag required).

scatter(ptr, indices, values);

prefetch

Issue a read-intent prefetch hint to bring data into all cache levels (T0).

prefetch(ptr, i)

Lowers to prefetcht0 on x86 / prfm pldl1keep on aarch64.

prefetch_write

Issue a write-intent prefetch hint. Signals the cache coherence protocol to acquire the target line in modified state ahead of the store, avoiding a read-for-ownership stall when the store retires. Use on the upcoming write target of memory-bound store-heavy kernels (e.g. chacha20 ciphertext output, dequantize destinations).

prefetch_write(ptr, i)

Lowers to prefetchw on x86 (requires PRFCHW CPUID; falls back to prefetcht0 on older CPUs) / prfm pstl1keep on aarch64.

prefetch_nta

Issue a non-temporal prefetch hint — bring the line into L1 only and mark it for early eviction. Use for streaming reads the kernel touches exactly once and shouldn't pollute L1/L2 with (e.g. Q4 dequantize input, large one-pass scans).

prefetch_nta(ptr, i)

Lowers to prefetchnta on x86 / prfm pldl1strm on aarch64.

Prefetch hint summary

Intrinsic(rw, locality)x86aarch64
prefetch(0, 3)prefetcht0prfm pldl1keep
prefetch_write(1, 3)prefetchwprfm pstl1keep
prefetch_nta(0, 0)prefetchntaprfm pldl1strm

All three accept (ptr, integer-offset), return void, and are valid in any expression-statement position inside a function body.

Math

sqrt

Square root. Works on scalar f32/f64 and all float vector types.

let y: f32 = sqrt(x);
let v: f32x8 = sqrt(vec);

rsqrt

Reciprocal square root (approximate). Scalar f32 and f32 vector types.

let y: f32 = rsqrt(x);
let v: f32x8 = rsqrt(vec);

exp

Exponential function. Float types.

let y: f32 = exp(x);

fma

Fused multiply-add: computes a * b + c in a single operation with one rounding step. Works on scalar f32 and all float vector types.

let y: f32 = fma(a, b, c);
let v: f32x8 = fma(va, vb, vc);

min

Element-wise minimum. Works on scalar (i32, f32, f64) and vector types.

let m: i32 = min(a, b);
let v: f32x8 = min(va, vb);

max

Element-wise maximum. Works on scalar (i32, f32, f64) and vector types.

let m: i32 = max(a, b);
let v: f32x8 = max(va, vb);

Reduction

Reduce a vector to a single scalar value.

reduce_add

Sum all lanes.

let sum: f32 = reduce_add(v);   // f32x8 -> f32
let sum: i32 = reduce_add(iv);  // i32x8 -> i32

reduce_max

Maximum across all lanes.

let m: f32 = reduce_max(v);

reduce_min

Minimum across all lanes.

let m: f32 = reduce_min(v);

reduce_add_fast

Unordered float reduction. Faster than reduce_add but does not guarantee summation order, so results may differ slightly due to floating-point rounding. Float vectors only.

let sum: f32 = reduce_add_fast(v);

Vector

splat

Broadcast a scalar value to all lanes of a vector. The vector type is inferred from context.

let v: f32x8 = splat(1.0);

shuffle

Compile-time index-driven vector shuffle. Two forms:

Single-source — permute lanes within one vector.

let reversed: f32x4 = shuffle(v, [3, 2, 1, 0])

Each index is in [0, width).

Two-source — pick lanes from two vectors of the same type.

let zipped: f32x8 = shuffle(a, b, [0, 8, 1, 9, 2, 10, 3, 11])

Indices in [0, width) select from a; indices in [width, 2 * width) select lane i - width from b. Common patterns: interleave (lower-half zip), blend (pick lanes by index), and concatenate-permute (lower-half from one source, upper from the other with permutation).

permute_runtime

Runtime-indexed permute of an 8-lane vector. result[k] = table[indices[k] & 0x7]. Index lanes are 3-bit-masked by the hardware — the upper 29 bits of each i32 index are ignored. Out-of-range indices wrap; they do not trap.

  • x86: single instruction (vpermps for f32, vpermd for i32). Requires AVX2.
  • ARM (NEON): not supported. Compile-time error pointing at the NEON runtime-permute idiom.
let table: f32x8 = load(matrix_row, 0)   // 6 active + 2 don't-care
let indices: i32x8 = load(types_v, 0)    // values in [0..5]
let strengths: f32x8 = permute_runtime(table, indices)
Signature(f32x8, i32x8) -> f32x8, (i32x8, i32x8) -> i32x8

See also: shuffle (compile-time indices), gather (pointer-indexed), shuffle_bytes (byte-level, 16-byte table, cross-platform).

select

Per-lane conditional select. Where the mask is true, take from a; where false, take from b.

let result: f32x8 = select(mask, a, b);

movemask

Extract a comparison result bitmask from a boolean vector to a scalar i32. Each bit corresponds to the sign bit of one lane. x86 only -- not available on ARM.

let bits: i32 = movemask(cmp_result);

Conversion

Scalar Casts

IntrinsicDescription
to_f32(x)Convert to f32
to_f64(x)Convert to f64
to_i32(x)Convert to i32
to_i64(x)Convert to i64
let f: f32 = to_f32(i);
let n: i32 = to_i32(x);

Widening Conversions

Widen narrow integer lanes to wider float or integer lanes. Only the first N lanes of the input are consumed.

IntrinsicInputOutput
widen_i8_f32x4(v)i8x16f32x4
widen_u8_f32x4(v)u8x16f32x4
widen_i8_f32x8(v)i8x16f32x8
widen_u8_f32x8(v)u8x16f32x8
widen_i8_f32x16(v)i8x16f32x16
widen_u8_f32x16(v)u8x16f32x16
widen_u8_i32x4(v)u8x16i32x4
widen_u8_i32x8(v)u8x16i32x8
widen_u8_i32x16(v)u8x16i32x16
let pixels: f32x8 = widen_u8_f32x8(raw_bytes);

Lane-offset variants

The _4, _8, _12 suffixes select which 4 bytes of the input to widen, eliminating the need for a shuffle before widening:

IntrinsicInputOutputBytes used
widen_u8_f32x4_4(v)u8x16f32x44-7
widen_u8_f32x4_8(v)u8x16f32x48-11
widen_u8_f32x4_12(v)u8x16f32x412-15
widen_i8_f32x4_4(v)i8x16f32x44-7
widen_i8_f32x4_8(v)i8x16f32x48-11
widen_i8_f32x4_12(v)i8x16f32x412-15
widen_u8_i32x4_4(v)u8x16i32x44-7
widen_u8_i32x4_8(v)u8x16i32x48-11
widen_u8_i32x4_12(v)u8x16i32x412-15

Process all 16 bytes of a u8x16 as 4 groups of f32x4 without any shuffles:

let f0: f32x4 = widen_u8_f32x4(v)      // bytes 0-3
let f1: f32x4 = widen_u8_f32x4_4(v)    // bytes 4-7
let f2: f32x4 = widen_u8_f32x4_8(v)    // bytes 8-11
let f3: f32x4 = widen_u8_f32x4_12(v)   // bytes 12-15

Narrowing Conversions

Convert wider lanes to narrower lanes, with clamping and rounding.

IntrinsicInputOutput
narrow_f32x4_i8(v)f32x4i8 (4 bytes)
let packed = narrow_f32x4_i8(float_pixels);

Multiply-Add Byte Pairs

Multiply unsigned bytes by signed bytes and add adjacent pairs. x86 only.

IntrinsicSignatureDescription
maddubs_i16(a, b)(u8x16, i8x16) -> i16x8Multiply and add adjacent pairs to 16-bit
maddubs_i32(a, b)(u8x16, i8x16) -> i32x4Multiply and add adjacent quads to 32-bit
let products: i16x8 = maddubs_i16(unsigned_bytes, signed_weights);

vdot_i32

Signed integer dot product: multiplies groups of 4 i8 pairs and sums each group into one i32 lane. ARM only -- requires --dotprod flag (ARMv8.2-A dot product extension). Maps to NEON sdot.

let dot: i32x4 = vdot_i32(activations, weights);
acc = acc .+ vdot_i32(a, b);  // accumulate explicitly
Signature(i8x16, i8x16) -> i32x4

I8MM Matrix Multiply

Matrix multiply-accumulate on int8 data. ARM only -- requires --i8mm flag (ARMv8.6-A I8MM extension). Available on Cortex-A78+, Apple M1+.

IntrinsicSignatureDescription
smmla_i32(acc, a, b)(i32x4, i8x16, i8x16) -> i32x4Signed x signed
ummla_i32(acc, a, b)(i32x4, u8x16, u8x16) -> i32x4Unsigned x unsigned
usmmla_i32(acc, a, b)(i32x4, u8x16, i8x16) -> i32x4Unsigned x signed

The accumulator is the first argument. Each instruction performs a 2x8 x 8x2 matrix multiply and adds the result to the accumulator. Use splat(0) as accumulator for the first iteration.

let zero: i32x4 = splat(0);
let result: i32x4 = smmla_i32(zero, activations, weights);
// accumulate over multiple chunks:
acc = smmla_i32(acc, next_a, next_b);

Widening Multiply

Multiply narrow integer lanes and produce wider output. ARM only (base NEON). Input types are 64-bit NEON vectors.

IntrinsicSignatureDescription
wmul_i16(a, b)(i8x8, i8x8) -> i16x8Signed 8-bit to 16-bit
wmul_u16(a, b)(u8x8, u8x8) -> u16x8Unsigned 8-bit to 16-bit
wmul_i32(a, b)(i16x4, i16x4) -> i32x4Signed 16-bit to 32-bit
wmul_u32(a, b)(u16x4, u16x4) -> u32x4Unsigned 16-bit to 32-bit
let wide: i16x8 = wmul_i16(bytes_a, bytes_b);

Absolute Difference

Element-wise |a - b|. ARM only (base NEON). Maps to a single instruction (sabd/uabd).

IntrinsicSupported Types
abs_diff(a, b)i8x16, u8x16, i16x8, u16x8, i32x4, u32x4
let diff: u8x16 = abs_diff(frame_a, frame_b);

Saturating Arithmetic

Addition and subtraction that clamp to the type's min/max instead of wrapping on overflow. Cross-platform (ARM NEON + x86 SSE2).

IntrinsicSupported Types
sat_add(a, b)i8x16, u8x16, i16x8, u16x8
sat_sub(a, b)i8x16, u8x16, i16x8, u16x8

Signed vs unsigned saturation is determined by the element type. Both arguments must have the same type.

let bright: u8x16 = sat_add(pixels, boost);    // clamps at 255, never wraps
let dark: u8x16 = sat_sub(pixels, reduce);     // clamps at 0, never wraps

shuffle_bytes

Byte-level table lookup: each byte in indices selects a byte from table. Cross-platform. x86: SSSE3 pshufb. ARM: NEON tbl. Out-of-range indices (>15) zero the lane on both platforms.

let result: u8x16 = shuffle_bytes(table, indices);
Signature(u8x16, u8x16) -> u8x16

Rounding & Packing

IntrinsicSignatureDescriptionPlatform
round_f32x4_i32x4(f32x4) -> i32x4Round-to-nearest-even. x86: cvtps2dq. ARM: fcvtns.cross-platform
pack_sat_i32x4(i32x4, i32x4) -> i16x8Saturating narrow. x86: packssdw. ARM: sqxtn.cross-platform
pack_sat_i16x8(i16x8, i16x8) -> i8x16Saturating narrow. x86: packsswb. ARM: sqxtn.cross-platform
round_f32x8_i32x8(f32x8) -> i32x8Round-to-nearest-even float to integer. x86: vcvtps2dq (AVX2).x86-only
pack_sat_i32x8(i32x8, i32x8) -> i16x16Saturating narrow two i32x8 into i16x16. x86: vpackssdw (AVX2).x86-only
pack_sat_i16x16(i16x16, i16x16) -> i8x32Saturating narrow two i16x16 into i8x32. x86: vpacksswb (AVX2).x86-only

pack_sat_i32x8 and pack_sat_i16x16 emit a vpermq fixup after the AVX2 pack to produce sequential output [a0..a7, b0..b7], matching what the 128-bit variants produce.

Debug

println

Print a value to stdout. Accepts scalars (i32, i64, u8, u16, u32, u64, f32, f64, bool), string literals, and vector types. Lowers to C printf. No format strings.

println(42);
println(3.14);
println("hello");
println(my_vector);