N or P
Contents Glossary
18

How different are these two strings

Spell check, diff tools, DNA matching, fuzzy search — all the same algorithm, all polynomial.

The edit distance between two strings is the number of character insertions, deletions, or substitutions to transform one into the other. It is the standard measure of string similarity and the foundation of spell checkers, diff tools, fuzzy search, plagiarism detection, and DNA sequence alignment. The algorithm is a textbook dynamic programming routine that runs in time proportional to the product of the two string lengths. Rust's strsim library implements Levenshtein distance, Damerau-Levenshtein (adds transposition), Jaro-Winkler (weights early matches more), and several other common variants as one-line calls. For very long sequences (DNA), the work scales as a square of the length and becomes expensive — specialized libraries with diagonal-band optimization handle it. For short strings (names, addresses, search queries), the work is microseconds and your team should reach for the library.

  1. 01
    strsim exposes every common edit-distance variant as a one-liner. Levenshtein for spell check, Damerau-Levenshtein for typos that include swaps, Jaro-Winkler for name matching where the first few characters matter most.
    use strsim::{damerau_levenshtein, jaro_winkler, levenshtein};
    
    fn best_match<'a>(query: &str, candidates: &'a [&'a str]) -> Option<&'a str> {
        candidates
            .iter()
            .min_by_key(|c| levenshtein(query, c))
            .copied()
    }
    
    fn name_score(a: &str, b: &str) -> f64 {
        jaro_winkler(a, b) // 1.0 = identical, 0.0 = no shared prefix
    }
    
    fn typo_distance(a: &str, b: &str) -> usize {
        damerau_levenshtein(a, b) // counts adjacent swaps as one edit
    }
  2. 02
    Page 4 introduced 'static — the special "lives forever" lifetime. Page 18 introduces a named lifetime parameter: <'a>. The angle brackets are the same place where type generics live (<T>, <K, V>), and lifetime parameters are exactly that — a kind of generic parameter, on the lifetime axis instead of the type axis. Where a type generic says "this function works for any type T," a lifetime generic says "this function works for any lifetime 'a, and here is how the lifetimes of my parameters relate to each other and to my return value." Convention puts lifetimes first inside the angle brackets: fn foo<'a, 'b, T, U>(...). The compiler infers concrete lifetimes at each call site from the references the caller passes in, exactly the way it infers concrete types for type generics.
    // Type generic — caller picks (or compiler infers) the type:
    fn first<T>(v: &[T]) -> Option<&T> {
        v.first()
    }
    let n = first(&[1, 2, 3]);                  // T = i32
    let s = first(&["a", "b"]);                 // T = &str
    
    // Lifetime generic — caller picks (or compiler infers) the lifetime:
    fn longest<'a>(a: &'a str, b: &'a str) -> &'a str {
        if a.len() > b.len() { a } else { b }
    }
    // The returned reference lives at most as long as the shorter of a, b.
    
    // Both kinds in one signature — lifetimes first, types second:
    fn pick<'a, T>(haystack: &'a [T], i: usize) -> &'a T {
        &haystack[i]
    }
    
    // Same machinery, different axis:
    //
    //                      type generic        lifetime generic
    //   declaration        <T>                 <'a>
    //   used in            parameter / return  parameter / return
    //   chosen by          caller / inference  caller / inference
    //   meaning            "any type T"        "any lifetime 'a"
    //
    // Lifetimes do not exist at runtime — the compiler erases them.
    // They are entirely about compile-time relationships between borrows.
  3. 03
    candidates: &'a [&'a str] reuses the same lifetime label 'a in two positions, which ties them together. The outer slice and the inner &str views must both live at least as long as 'a. Why unify them? Because the function returns Option<&'a str> — a reference whose lifetime is 'a. The compiler needs to know the returned string is valid for at least that long, so the caller knows how long they can hang on to it. If the slice itself or any of the strings inside it could disappear sooner than 'a, the returned reference would dangle. Tying both to one lifetime says "everything the result might point into shares the same lifetime budget." An alternative — &'a [&'b str] — would use two separate lifetimes; more flexible but more to manage. Unifying is the right default; revisit only if a real caller is blocked by the constraint.
    fn best_match<'a>(query: &str, candidates: &'a [&'a str]) -> Option<&'a str>
    //             ──                ───   ────                          ───
    //             declare 'a        slice  inner                        return
    //                               lives  &strs                        reference
    //                               'a     live 'a                      lives 'a
    
    // What the annotations promise the compiler:
    //   • the slice `candidates` is valid for at least 'a
    //   • each &str inside the slice is valid for at least 'a
    //   • the returned &str is valid for exactly 'a — caller can hold it that long
    
    // Two-lifetime alternative — more flexible, more to manage:
    fn best_match_2<'slice, 'item>(
        query: &str,
        candidates: &'slice [&'item str],
    ) -> Option<&'item str>            // result tied to the items, not the slice
    where 'slice: 'item {              // and slice must outlive items
        candidates.iter().min_by_key(|c| levenshtein(query, c)).copied()
    }
    
    // One-lifetime version (the page 18 code) is the usual default — pick the
    // stricter shape that's still ergonomic.  Reach for two only when a real
    // caller needs the slice and items to come from different scopes.
    
    // Note that `query: &str` has no annotation.  Lifetime elision (page 4)
    // fills in an anonymous one — the function does not return anything tied
    // to query, so its lifetime is independent and need not be named.
  4. 04
    The last line of best_match calls .copied(). The name resembles the Copy trait introduced on page 15, and the two are related but not the same thing. Copy is a property a type can implement, saying "bitwise duplication is safe for me." .copied() is an iterator method that converts an iterator of &T into an iterator of T by copying each item — and it only compiles when the items being copied implement the Copy trait. So the method uses the trait to do its job. For types that need a deeper duplication (String, Vec<T>), the sibling method is .cloned(), which requires only the Clone trait and may allocate. The other unfamiliar type in this step is &&str. Read right-to-left: str is the unsized string-of-bytes type, &str is a reference to one (the type we usually call a "string slice"), and &&str is a reference to a &str — two reference layers stacked. This is not a reference to a slice (those look like &[T]); the two ampersands here are just & applied to &str. The double layer arises because .iter() always yields references to the elements of a collection, and here the elements are themselves references — candidates is &[&str], so iterating gives &&str. .copied() strips one layer back, turning the iterator's output from &&str to &str.
    // Copy is a TRAIT — a property of a type.
    //   i32, bool, char, f64, &T    →  implements Copy
    //   String, Vec<T>, Box<T>      →  does NOT implement Copy
    
    // .copied() is a METHOD on Iterator<Item = &T> where T: Copy.
    // It dereferences and bit-copies each item, yielding an iterator of T.
    
    let v: Vec<i32> = vec![1, 2, 3];
    
    let refs:   Vec<&i32>  = v.iter().collect();              // iter of &i32
    let copies: Vec<i32>   = v.iter().copied().collect();     // iter of i32
    
    // Why we use it on page 18 — one fewer layer of reference for the caller.
    // candidates has type &[&str], so each element is already a &str.
    // .iter() yields references TO each element, stacking another & on top:
    //
    //   &&str  =  &(&str)         a reference to a string-slice reference.
    //                              Read right-to-left.  Not a slice reference
    //                              (those look like &[T]) — two & layers stacked.
    //
    //   candidates                                      &[&str]
    //   .iter()                                         iter of &&str
    //   .min_by_key(|c| levenshtein(query, c))          Option<&&str>   ← two & layers
    //   .copied()                                       Option<&str>    ← one stripped
    
    // The siblings:
    //   .copied()   →  requires Copy.   Pure bit-copy; never allocates.
    //   .cloned()   →  requires Clone.  May allocate (e.g. cloning a String).
    
    let strings: Vec<String> = vec!["a".into(), "b".into()];
    let owned: Vec<String> = strings.iter().cloned().collect();   // allocates each clone
    // let owned: Vec<String> = strings.iter().copied().collect(); // error: String: !Copy
    
    // Short version:
    //   Copy        →  a trait;  a property of a type.
    //   .copied()   →  a method; uses the Copy trait to turn &T into T.
    // They live next to each other and they are not the same thing.
Wagner, R., Fischer, M. (1974) The String-to-String Correction Problem. Levenshtein, V. (1965). Source →
In plain terms

Edit distance is the most useful string-similarity metric in business software. Spell checkers use it to find the most likely correction for a typo. Diff tools use it to find the minimal change set between two file versions. Fuzzy search uses it to match queries to records when the user does not type the exact spelling. Address normalization uses it to detect that "123 Main Street" and "123 Main St" are the same address. Bioinformatics uses it (with custom scoring) to align DNA sequences.

The algorithm is one of the cleanest examples of dynamic programming. You fill in a table where each cell represents the edit distance between prefixes of the two strings, building up from the empty prefix. The table fills in linear time per cell, so the total work is the product of the two string lengths. For two strings of a hundred characters each, this is ten thousand operations — microseconds. For two strings of a million characters each, it is a trillion operations and becomes a research problem.

For short strings — names, addresses, query terms, ticket titles — the work is so cheap that you should compute distances on demand. Rust's strsim library handles the standard variants in a one-line call. Levenshtein for basic edit distance. Damerau-Levenshtein for typos that involve swapping adjacent characters (a common kind of error). Jaro-Winkler for cases where matching the beginning of a string matters more (good for name matching).

The variants matter because the right metric depends on the kind of error you are catching. Jaro-Winkler is the standard in record linkage and customer-database deduplication. Levenshtein is the right choice for spell checking. Damerau-Levenshtein for keyboard typos. Hamming distance (no insertions or deletions) for fixed-length codes.

The teaching for planning: any feature that involves "how similar are these two strings" is a library call away. When the requirement is more elaborate — matching across millions of records, with fuzzy joins and blocking — the work is in the data plumbing, not the similarity computation.

The similarity computation is solved. Reach for it and move on.

← 17 Searching for text patterns
18/39
19 Optimization with continuous numbers →