Common Intrinsics

This page covers the most frequently used intrinsics. For the complete list, see the Intrinsics Reference.

splat

Broadcast a scalar to all lanes of a vector:

let factor: f32 = 2.5
let vf: f32x8 = splat(factor)
// vf = [2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5]

Works with all vector types. The return type is inferred from the variable's type annotation.

load / store

Load a vector from a pointer at an element offset. Store writes a vector back:

let v: f32x8 = load(data, i)       // read 8 floats starting at data[i]
store(out, i, v)                    // write 8 floats starting at out[i]

Offsets are in elements, not bytes.

Typed loads

When you want to be explicit about the element type:

let v: f32x8 = load_f32(data, i)
let v: f32x4 = load_f32x4(data, i)

These are equivalent to plain load but make the element type visible in the source.

Non-temporal store — bypasses cache, used for write-only output. Vector or scalar value (v1.15.0 added scalar i16/u16/i32/u32/i64/u64). See reference for the full alignment contract, ordering contract, and anti-patterns.

stream_store(out, i, result)         // vector
stream_store(out, i, scalar_value)   // scalar (v1.15.0+)

fma

Fused multiply-add: computes a * b + c in a single instruction with a single rounding (more accurate than separate multiply and add):

// Scalar
let result: f32 = fma(a, b, c)

// Vector
let va: f32x8 = load(a, i)
let vb: f32x8 = load(b, i)
let vc: f32x8 = load(c, i)
let result: f32x8 = fma(va, vb, vc)

Works on f32, f64, and all float vector types. Maps to the hardware FMA instruction.

reduce_add

Sum all lanes of a vector down to a scalar:

let v: f32x8 = load(data, i)
let sum: f32 = reduce_add(v)

Works on all integer and float vector types. Useful for dot products, reductions, and histogram accumulation.

reduce_max / reduce_min

Find the maximum or minimum value across all lanes:

let v: f32x4 = load(data, i)
let biggest: f32 = reduce_max(v)
let smallest: f32 = reduce_min(v)

Works on integer and float vector types.

select

Per-lane conditional: where the mask is true, take from a; where false, take from b:

let mask: f32x8 = a .> b
let result: f32x8 = select(mask, a, b)
// element-wise max(a, b)

Compiles to a blend instruction. No branching. This is how you write branchless SIMD code.

sqrt / rsqrt

Square root and reciprocal square root (1/sqrt):

let x: f32 = 16.0
let root: f32 = sqrt(x)       // 4.0

let v: f32x8 = load(data, i)
let roots: f32x8 = sqrt(v)
let inv_roots: f32x8 = rsqrt(v)

rsqrt uses the fast hardware approximation. Works on f32, f64, and float vector types.

min / max

Element-wise minimum and maximum. Works on both scalars and vectors:

// Scalar
let smaller: f32 = min(a, b)
let larger: f32 = max(a, b)

// Vector
let va: f32x8 = load(data, i)
let vb: f32x8 = load(data, j)
let mins: f32x8 = min(va, vb)
let maxs: f32x8 = max(va, vb)

movemask

Extract comparison results to an integer bitmask. x86 only:

let mask: f32x8 = a .> b
let bits: i32 = movemask(mask)
// bit k is 1 if lane k of a > lane k of b

Useful for branching on SIMD comparison results or counting matching elements. Each bit in the result corresponds to one vector lane.

sat_add / sat_sub

Saturating addition and subtraction. Values clamp to the type's min/max instead of wrapping on overflow. Cross-platform (ARM NEON + x86 SSE2):

let bright: u8x16 = sat_add(pixels, boost)    // clamps at 255
let dark: u8x16 = sat_sub(pixels, reduce)     // clamps at 0

Works with i8x16, u8x16, i16x8, u16x8. Signed vs unsigned saturation is determined by the element type. Both arguments must have the same type.

Masked memory operations

For tail handling, masked loads and stores read/write only the valid lanes:

let rem: i32 = n - i
let v: f32x4 = load_masked(data, i, rem)    // load only 'rem' elements
store_masked(out, i, result, rem)            // store only 'rem' elements

The rem parameter specifies how many elements (starting from lane 0) are valid. Lanes beyond rem are zero-filled on load and not written on store.

Summary table

Intrinsic	Input	Output	Description
`splat(s)`	scalar	vector	Broadcast to all lanes
`load(ptr, i)`	pointer, offset	vector	Load vector from memory
`store(ptr, i, v)`	pointer, offset, vector	void	Write vector to memory
`stream_store(ptr, i, v)`	pointer, offset, vector or scalar	void	Non-temporal write
`fence_nt()`	none	void	Store-store barrier for stream_store ordering
`fma(a, b, c)`	3 values	same type	`a * b + c` fused
`reduce_add(v)`	vector	scalar	Sum all lanes
`reduce_max(v)`	vector	scalar	Max across lanes
`reduce_min(v)`	vector	scalar	Min across lanes
`select(m, a, b)`	mask, 2 vectors	vector	Per-lane conditional
`sqrt(x)`	scalar/vector	same type	Square root
`rsqrt(x)`	scalar/vector	same type	Reciprocal square root
`min(a, b)`	2 values	same type	Element-wise minimum
`max(a, b)`	2 values	same type	Element-wise maximum
`sat_add(a, b)`	2 int vectors	same type	Saturating addition
`sat_sub(a, b)`	2 int vectors	same type	Saturating subtraction
`movemask(m)`	bool vector	i32	Extract lane mask to bits

Eä — Compute Kernel Compiler