Searching for text patterns

Pattern matching in text is a solved category. Reach for the library; the library is faster than anything your team will write.

Searching for patterns in text — substrings, regular expressions, dictionaries of keywords — is one of the most engineered problems in computing. Industrial implementations scan gigabytes of text per second on a single CPU core. The libraries handle every edge case — Unicode, multi-pattern, anchored, case-insensitive, with SIMD acceleration. Rust's aho-corasick library handles multi-matching" data-term="pattern matching">pattern matching (finding any of thousands of keywords in a body of text). Rust's regex library handles regular expressions and is the fastest in widespread use. Rust's memchr library handles single-byte and short-pattern scanning with SIMD. Your team should never write their own pattern matcher. The only flavor that is genuinely hard is regular expressions with backreferences (PCRE-style) — those can be made to run exponentially slow. Rust's regex library deliberately excludes them.

aho-corasick compiles a list of keywords into a single automaton, then scans a body of text in one pass. Ideal for content moderation, spam filtering, intrusion detection — anywhere you need many needles in one haystack.

use aho_corasick::AhoCorasick;

fn flag(text: &str, banned: &[&str]) -> Vec<(usize, String)> {
    let ac = AhoCorasick::new(banned).expect("valid patterns");
    ac.find_iter(text)
        .map(|m| (m.start(), banned[m.pattern()].to_string()))
        .collect()
}

&[&str] is a slice of string slices, taken by reference. Three layers nested together: &str is one string slice (a view into bytes, page 4), [&str] is a sequence of those, and the outer & borrows the sequence. Reading it inside-out: "a borrowed view of a sequence of borrowed string views." This is the most permissive shape for "a list of read-only strings I want to scan." Callers can pass an array literal directly (&["foo", "bar"]), or borrow a Vec<&str> (&patterns). Taking &[&str] instead of Vec<String> means the function does not demand ownership of the patterns and the caller does not have to allocate. It is the page-14 rule — parameters borrow, not own — applied to a list.

// &[&str] decoded:
//
//      & [ &str ]
//      ─ ─ ────
//      │ │  │
//      │ │  └── &str — one string slice (borrowed view of bytes)
//      │ └──── [&str] — a sequence of string slices
//      └────── & — a borrow of that sequence
//
// "A borrowed view of a sequence of borrowed string views."

fn flag(text: &str, banned: &[&str]) -> Vec<(usize, String)> { /* ... */ }

// Every one of these calls works:
flag(input, &["secret", "password"]);              // array literal — most ergonomic
flag(input, &vec!["secret", "password"]);          // Vec<&str> coerces to &[&str]

let patterns: Vec<&str> = lines.iter().map(|l| l.trim()).collect();
flag(input, &patterns);                            // borrow an existing Vec<&str>

// Contrast with a shape that forces every caller to allocate:
fn flag_owned(text: &str, banned: Vec<String>) { /* ... */ }
flag_owned(input, vec!["secret".to_string(),
                       "password".to_string()]);   // .to_string() per literal — wasteful

// One subtle catch — &[String] does NOT auto-coerce to &[&str]:
let owned: Vec<String> = vec!["secret".into(), "password".into()];
flag(input, &owned);                                // ✗ type mismatch
//
// Fix: convert with .iter().map(String::as_str).collect::<Vec<&str>>().
// Or accept the most flexible shape: &[impl AsRef<str>].
// For most flagging APIs, &[&str] is the right default — literals are
// the common case.

The return type is Vec<(usize, String)> — each tuple holds the byte offset where the match started and a fresh String copy of the banned pattern. Why String instead of &str? Returning &str would work but only with an explicit lifetime tying the return value to the banned slice: the result would point into the input, and the caller would have to keep banned alive for as long as they held the result. Returning String allocates a small fresh copy per match (a handful of bytes each, usually cheap), and the result is self-contained — callers can store it, drop banned, send the result across threads, serialize it to JSON. This is the page-14 rule in action: return new text the caller should own (String); return a view tied to inputs (&str with a lifetime). When the lifetime would make the API noticeably worse for downstream consumers, String is the right tradeoff.

// What we have — self-contained result:
fn flag(text: &str, banned: &[&str]) -> Vec<(usize, String)> {
    let ac = AhoCorasick::new(banned).expect("valid patterns");
    ac.find_iter(text)
        .map(|m| (m.start(), banned[m.pattern()].to_string()))
        //                                       └─ allocates a fresh String per match
        .collect()
}

// Zero-allocation alternative — &str views tied to banned's lifetime:
fn flag_borrowed<'a>(text: &str, banned: &'a [&'a str])
    -> Vec<(usize, &'a str)>
{
    let ac = AhoCorasick::new(banned).expect("valid patterns");
    ac.find_iter(text)
        .map(|m| (m.start(), banned[m.pattern()]))
        //                                  └─ no allocation; lifetime tied to banned
        .collect()
}

// Tradeoffs:
//
//   Vec<(usize, String)>     self-contained.  caller can drop banned.
//                            JSON-serialisable, thread-shippable, store anywhere.
//                            one small allocation per match.
//
//   Vec<(usize, &'a str)>    zero allocation per match.
//                            caller must keep banned alive while result is in use.
//                            every consumer downstream has to think about lifetimes.
//
// For a flagging API where results get logged, stored, or sent to a worker,
// the String variant is the friendlier shape.  The cost is a few bytes
// copied per hit — usually invisible.

Worth pausing on the word borrow, since it has been doing a lot of work for the last ten pages. It is a deliberate metaphor in Rust, not just casual jargon. Every value in a Rust program has exactly one owner; when the owner goes out of scope, the value is dropped — memory freed, file handles closed, locks released. A borrow is what the English word suggests: temporary access to a value without transferring ownership. You hand someone your book; they read it; they give it back; you still own it. Two flavors: a shared borrow &T lets any number of readers see the value at once but lets nobody change it (many people reading the same poster). An exclusive borrow &mut T gives one writer permission to mutate the value, and while it exists nobody else — not even the original owner — can read or write that value (one person editing a document). The compiler enforces this "shared XOR mutable" rule at compile time, which is what the borrow checker does. "Fighting the borrow checker" is the universal Rust-learner experience of running headlong into the rule. The rules are stricter than other languages. The payoff is that at runtime your program has no data races, no use-after-free, no iterator-invalidation crashes — every one of those bugs is turned into a compile error.

// ── Ownership: every value has exactly one owner ──────────────
let s = String::from("hello");
let t = s;                       // ownership MOVES from s to t
// println!("{}", s);            // error: borrow of moved value `s`


// ── Borrow: temporary access, no ownership transfer ───────────
let s = String::from("hello");
let r = &s;                      // r BORROWS s — s still owns the bytes
println!("{} {}", s, r);         // both readable; the data has not moved


// ── Two flavors ───────────────────────────────────────────────

// 1. Shared borrows (&T) — many readers, no writers.
let s = String::from("hello");
let a = &s;
let b = &s;                      // fine — many shared borrows can coexist
println!("{} {} {}", s, a, b);

// 2. Exclusive borrows (&mut T) — one writer, no other readers.
let mut s = String::from("hello");
let m = &mut s;
m.push_str(", world");           // can mutate through &mut
// println!("{}", s);            // error: cannot use s while m exists
println!("{}", m);               // ok — m is the one allowed reference


// ── The rule the borrow checker enforces ─────────────────────
//
// At any point in the program, for any value, you may have EITHER:
//
//      • any number of shared borrows (&T)        OR
//      • exactly one exclusive borrow (&mut T)
//
// Never both.  Never two &mut at once.
//
let mut s = String::from("hello");
let a = &s;                      // shared borrow
let b = &mut s;                  // error: cannot also borrow as mutable
println!("{}", a);


// ── "Fighting the borrow checker" — and how not to ───────────
//
// The fix is never to disable the checker.  The fix is to restructure
// so the lifetimes are clearer:
//
//   • finish using the shared borrow, THEN take the mutable borrow
//   • clone the data if two independent owners are actually what you want
//   • split the data so two halves can be borrowed independently
//   • thread the data through one owner instead of sharing
//
// What you buy by accepting the rules — at runtime, Rust programs do not have:
//   • data races            (two threads writing the same value)
//   • use-after-free        (reading freed memory)
//   • iterator invalidation (modifying a collection while iterating it)
//
// All three are turned into compile errors.  The borrow checker is
// the part of the compiler that does that work.

In plain terms

Pattern matching in text is one of those problems where the libraries are so far ahead of what an in-house team will produce that there is no business case to write your own. Modern implementations of substring search, regex matching, and multi-pattern keyword scanning run at gigabytes per second on a single core, with hand-tuned SIMD assembly for the inner loop.

The standard library or the regex library handles the basics. When you need multi-pattern matching — looking for any of thousands of keywords in a body of text, as in spam filtering, content moderation, or intrusion detection — the Aho-Corasick algorithm is the right tool, and Rust's aho-corasick library is industrial-grade. When you need single-pattern search at maximum speed, memchr exploits SIMD instructions to scan a hundred bytes per cycle.

The one trap to know about: regular expressions with backreferences. PCRE-style regular expressions (the kind in Perl, PHP, JavaScript) allow patterns to refer back to previously matched substrings, which lets you express things like "the same word twice in a row." Patterns with backreferences are NP-hard in the worst case, and a malicious input can cause them to run exponentially slow. This is the source of regex-based denial-of-service attacks.

Rust's regex library deliberately omits backreferences. This is a feature, not a bug. Every pattern is guaranteed to match in time proportional to the input length. If your team's existing regex usage requires backreferences, the right path is to model the requirement differently — usually parse the text first, then check the structural property — rather than to use a different regex engine.

For business planning, pattern matching is one of the very few "you should never build your own" categories. The libraries are mature, fast, well-maintained, and free. Tickets that involve searching text should be sized in hours, not days.

When the requirement looks bigger than that, the work is somewhere else — in defining what to search for, in scaling to many documents, in serving results — not in the pattern matcher itself.