Text Processing
Text processing benefits from SIMD when you can skip large regions of uninteresting bytes. The key pattern is chunk-skip: load 32 bytes at a time, check if any byte matches a target character, and skip the entire chunk if none match.
The chunk-skip pattern
Most bytes in a text file are not the character you are looking for. A newline scanner over a 1 MB file might find 10,000 newlines among 1,000,000 bytes -- 99% of chunks contain no match and can be skipped with a single comparison and branch.
export func count_newlines(data: *u8, n: i32) -> i32 {
let newline: u8x32 = splat_u8x32(10)
let mut count: i32 = 0
let mut i: i32 = 0
while i < n - 31 {
let chunk: u8x32 = load_u8x32(data, i)
let matches: u8x32 = chunk .== newline
let mask: i32 = movemask(matches)
if mask != 0 {
let mut j: i32 = 0
while j < 32 {
if load_u8(data, i + j) == 10 {
count = count + 1
}
j = j + 1
}
}
i = i + 32
}
let mut k: i32 = i
while k < n {
if load_u8(data, k) == 10 {
count = count + 1
}
k = k + 1
}
count
}
The structure:
- Load 32 bytes as
u8x32 - Compare with
splat_u8x32(target)using.==to get a match vector movemask()collapses the vector comparison to a singlei32bitmask- If mask is 0: no matches in this chunk, skip ahead 32 bytes (fast path)
- If mask is nonzero: scan the 32 bytes individually (slow path, but rare)
- Scalar tail: handle remaining bytes that do not fill a full 32-byte chunk
The fast path processes 32 bytes in about 3 instructions. On typical text, 90-99% of chunks take the fast path.
Why not extract individual bit positions?
Ea does not have shift operators (<<, >>), so you cannot extract individual bit positions from the movemask result. Instead, when a chunk contains matches, fall back to byte-by-byte scanning within that 32-byte window. This is still fast because:
- Hot chunks are rare (most text is not your target character)
- The 32-byte scan is a tight loop that fits in L1 cache
- The overall speedup comes from skipping cold chunks, not from optimizing hot ones
ARM portability
movemask is an x86-only intrinsic (it maps to vpmovmskb). On ARM/NEON, there is no equivalent single instruction. For portable kernels, write separate x86 and ARM versions:
- x86: use
u8x32+movemaskas shown above - ARM: use
u8x16with scalar fallback, or structure the algorithm to avoid needing a bitmask
See ARM / NEON for details on architecture-specific intrinsics.
CSV parsing
CSV parsing combines the chunk-skip pattern with state tracking. You need to find commas and newlines while respecting quoted fields -- a comma inside quotes is not a delimiter.
The approach:
- Scan for quote characters (
") using chunk-skip to track quote state - Scan for delimiters (
,and\n) using chunk-skip, filtering by quote state - Parse numeric fields from the delimited regions
This is compute-bound because each byte potentially involves comparison against multiple target characters plus state logic -- exactly the kind of branching work that NumPy cannot vectorize.
For a complete CSV statistics package built on this pattern, see eastat.
Tips
- Buffer alignment: SIMD loads from unaligned addresses work but may be slower on older hardware. If you control the buffer, align to 32 bytes.
- Tail handling: always include a scalar loop for the last
n % 32bytes. Forgetting the tail is the most common bug in SIMD text processing. - Multi-character search: to find any of several characters (e.g.,
<,>,&for HTML), do multiple.==comparisons and combine with.|before the movemask.