Fast Transcendentals: exp_poly_f32
Eä's exp() intrinsic calls libm via llvm.exp.v*f32. LLVM has no
hardware vector exp on any current ISA — not on AVX2, not on AVX-512,
not on NEON, not on SVE. So exp(f32x8) always lowers to a loop of
eight sequential expf calls. The SIMD pattern in the source is
cosmetic: throughput stays at one element per scalar libm call.
In a softmax loop or a tanh-GELU activation, that scalarization is the
entire kernel. exp_poly_f32(f32xN) -> f32xN (new in v1.11.0) replaces
it with a polynomial that stays in SIMD registers — seven to eight FMAs
per lane, no libm call, no scalarization.
The throughput win depends on the baseline:
- Modern x86 with a fast
expfin glibc (AMD Zen 4 / glibc 2.42 on our reference benchmark): 2.93× isolated, 2.60× inside softmax. - Pi 5 (ARM Cortex-A76, glibc
expf, nolibmvec) measured inside a real GELU kernel in Olorin: 2.23× end-to-end ongemma4_gelu, consistent across 64-12288-lane shapes. - Older libm or scalar-only environments without a vectorized
expf: the gap widens — the spec's original "~10×" headline holds against the slowest baselines but does not match every modern glibc.
The win is real on every baseline measured; the magnitude is
environment-sensitive (glibc 2.42's expf is faster than the spec
assumed, which compresses the headline).
The contract
exp_poly_f32 is not a drop-in replacement for exp(). It trades
two things for the throughput:
- Bounded input range. It is defined on
[-50, 50]per lane. Outside that range the polynomial diverges — no NaN or Inf guarantees, no clamping. The caller clamps if their inputs may exceed. - Bounded accuracy. Relative error is ≤ 2⁻¹⁸ (~3.8e-6) inside the
safe range. That's enough for softmax (normalize-by-sum absorbs the
error) and for tanh-GELU activations. It is not enough for
anything that needs full f32 precision — keep
exp()for those.
The name encodes the trade. Reading exp_poly_f32 at a call site, you
know it's a polynomial on f32 lanes — same explicitness style as
pack_sat_i32x8, widen_u8_i32x4, cvt_f16_f32.
Softmax
The canonical use case is numerically-stable softmax over a small fixed window. The maximum subtract keeps inputs in a tight range, exp is the dominant cost, and the final normalize absorbs the polynomial's relative error.
// Stable softmax over 8 lanes using exp_poly_f32.
// reduce_max keeps inputs in a tight range so we never approach the
// [-50, 50] contract edge — the test fixture uses values 1..8 and the
// shifted range is [-7, 0].
export func softmax(x: *f32, out: *mut f32) {
let v: f32x8 = load(x, 0)
let mx: f32 = reduce_max(v)
let mxv: f32x8 = splat(mx)
let shifted: f32x8 = v .- mxv
let ev: f32x8 = exp_poly_f32(shifted)
let s: f32 = reduce_add(ev)
let inv: f32 = 1.0 / s
let invv: f32x8 = splat(inv)
let r: f32x8 = ev .* invv
store(out, 0, r)
}
The full integration test for this shape (test_exp_poly_f32_softmax_integration
in tests/phase14_exp_poly.rs) compares against expf-based reference
softmax with relative tolerance 1e-3 (~2⁻¹⁰) — the polynomial's 2⁻¹⁸
error compounds through reduce_add, but the normalize-by-sum still
collapses it to a tolerance that's well below softmax's typical
numerical envelope.
tanh-GELU via tanh_approx_f32
The GELU activation gelu(x) ≈ 0.5 · x · (1 + tanh(c · (x + 0.044715 · x³)))
with c = sqrt(2/π) is the canonical use of a fast vector tanh.
tanh_approx_f32(f32xN) -> f32xN (new in v1.14.0) lowers to a rational
P(x²) · x / Q(x²) minimax-tuned for f32, branchless and bounded to
~3e-7 absolute error across the body, with internal clamping at ±9
handling saturation:
// GELU(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
export func gelu_tanh(input: *f32, output: *mut f32) {
let x: f32x8 = load(input, 0)
// inner = sqrt(2/pi) * (x + 0.044715 * x^3)
let c0: f32x8 = splat(0.7978845608)
let c1: f32x8 = splat(0.044715)
let xsq: f32x8 = x .* x
let xcube: f32x8 = xsq .* x
let inner: f32x8 = c0 .* fma(c1, xcube, x)
// tanh_approx_f32 clamps internally; no defensive bound needed.
let t: f32x8 = tanh_approx_f32(inner)
// gelu = 0.5 * x * (1 + t)
let half: f32x8 = splat(0.5)
let one: f32x8 = splat(1.0)
let result: f32x8 = half .* x .* (one .+ t)
store(output, 0, result)
}
This replaces the earlier (exp_poly_f32(2x) - 1) / (exp_poly_f32(2x) + 1)
algebraic-identity workaround. The workaround suffers catastrophic
cancellation in the numerator for small |x| (e^{2·1e-3} - 1 loses
~3 digits of precision in f32), degrading to ~10⁻³ relative error near
zero where GELU is most sensitive. The dedicated intrinsic skips both
that pitfall and the second arithmetic chain, while costing one fdiv
instead of one fdiv plus the exp_poly_f32 polynomial.
Safely clamping in-range
The contract is bounded on [-50, 50] per lane, so the safest defensive
pattern is the min-then-max chain shown above:
let hi: f32x8 = splat(50.0)
let lo: f32x8 = splat(-50.0)
let safe: f32x8 = min(max(unbounded, lo), hi)
let e: f32x8 = exp_poly_f32(safe)
Both min and max lower to single NEON fminnm / fmaxnm (or x86
vminps / vmaxps) instructions, so the clamp is two FMAs of overhead
on a polynomial that's already 7-8 FMAs per lane.
If you can prove statically that your input range is bounded — softmax
after a reduce_max subtract, for instance — skip the clamp. The
guarantee is mathematical, not enforced; the polynomial does not check.
When to keep exp()
exp_poly_f32 is the right tool for SIMD hot paths where 2⁻¹⁸ relative
error is acceptable. Examples: softmax, GELU, attention scoring,
exponential decay in beam search.
Keep exp() (which calls libm) for:
- Scalar code, or vector code where the SIMD scalarization isn't visible (rare loops, one-shot computations).
- Code that needs full f32 precision — physics, statistics, log-sum-exp inside an accuracy-critical reduction.
- Inputs that may exceed
[-50, 50]and a libm-correct NaN/Inf is required.
See also
- Test fixtures (accuracy, range, softmax integration):
tests/phase14_exp_poly.rs