Text Processing

Text processing benefits from SIMD when you can skip large regions of uninteresting bytes. The key pattern is chunk-skip: load 32 bytes at a time, check if any byte matches a target character, and skip the entire chunk if none match.

The chunk-skip pattern

Most bytes in a text file are not the character you are looking for. A newline scanner over a 1 MB file might find 10,000 newlines among 1,000,000 bytes -- 99% of chunks contain no match and can be skipped with a single comparison and branch.

export func count_newlines(data: *u8, n: i32) -> i32 {
    let newline: u8x32 = splat_u8x32(10)
    let mut count: i32 = 0
    let mut i: i32 = 0
    while i < n - 31 {
        let chunk: u8x32 = load_u8x32(data, i)
        let matches: u8x32 = chunk .== newline
        let mask: i32 = movemask(matches)
        if mask != 0 {
            let mut j: i32 = 0
            while j < 32 {
                if load_u8(data, i + j) == 10 {
                    count = count + 1
                }
                j = j + 1
            }
        }
        i = i + 32
    }
    let mut k: i32 = i
    while k < n {
        if load_u8(data, k) == 10 {
            count = count + 1
        }
        k = k + 1
    }
    count
}

The structure:

Load 32 bytes as u8x32
Compare with splat_u8x32(target) using .== to get a match vector
movemask() collapses the vector comparison to a single i32 bitmask
If mask is 0: no matches in this chunk, skip ahead 32 bytes (fast path)
If mask is nonzero: scan the 32 bytes individually (slow path, but rare)
Scalar tail: handle remaining bytes that do not fill a full 32-byte chunk

The fast path processes 32 bytes in about 3 instructions. On typical text, 90-99% of chunks take the fast path.

Why not extract individual bit positions?

Eä has element-wise vector shift operators (.<<, .>>), but movemask returns a scalar i32 bitmask. Extracting individual bit positions from a scalar requires scalar bitwise operations (&, >>) which work on i32 values. When a chunk contains matches, fall back to byte-by-byte scanning within that 32-byte window. This is still fast because:

Hot chunks are rare (most text is not your target character)
The 32-byte scan is a tight loop that fits in L1 cache
The overall speedup comes from skipping cold chunks, not from optimizing hot ones

ARM portability

movemask is an x86-only intrinsic (it maps to vpmovmskb). On ARM/NEON, there is no equivalent single instruction. For portable kernels, write separate x86 and ARM versions:

x86: use u8x32 + movemask as shown above
ARM: use u8x16 with scalar fallback, or structure the algorithm to avoid needing a bitmask

See ARM / NEON for details on architecture-specific intrinsics.

CSV parsing

CSV parsing combines the chunk-skip pattern with state tracking. You need to find commas and newlines while respecting quoted fields -- a comma inside quotes is not a delimiter.

The approach:

Scan for quote characters (") using chunk-skip to track quote state
Scan for delimiters (, and \n) using chunk-skip, filtering by quote state
Parse numeric fields from the delimited regions

This is compute-bound because each byte potentially involves comparison against multiple target characters plus state logic -- exactly the kind of branching work that NumPy cannot vectorize.

For a complete CSV statistics package built on this pattern, see eastat.

Tips

Buffer alignment: SIMD loads from unaligned addresses work but may be slower on older hardware. If you control the buffer, align to 32 bytes.
Tail handling: always include a scalar loop for the last n % 32 bytes. Forgetting the tail is the most common bug in SIMD text processing.
Multi-character search: to find any of several characters (e.g., <, >, & for HTML), do multiple .== comparisons and combine with .| before the movemask.