Reducibility
Glossary Contents
← Ch XIV
Chapter XV Hitting Set
Ch XVI →
  1. 43 — Hit them all
  2. 44 — Greedy in Perl
  3. 45 — Test suite minimization
44

Greedy in Perl

Perl is at home in the world of "match this pattern in that file." Hitting set on a corpus of regex requirements is a Perl-shaped problem.

  1. 01
    Read the family from stdin: each line a set, members space-separated. Perl's hash-of-hashes makes the count-by-element pattern a one-liner.
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    my @family;
    while (my $line = <STDIN>) {
        chomp $line;
        my %set = map { $_ => 1 } split /\s+/, $line;
        push @family, \%set;
    }
  2. 02
    Greedy step: count, across all uncovered sets, how many each element hits. Pick the element with the highest count. This is the classic ln(n)-approximation, the same one as for set cover.
    sub pick_best {
        my @open = @_;
        my %count;
        for my $set (@open) {
            $count{$_}++ for keys %$set;
        }
        my ($best) = sort { $count{$b} <=> $count{$a} } keys %count;
        return $best;
    }
  3. 03
    Loop: while uncovered sets remain, pick the best element, add it to the hitting set, drop every set it hits. Print the running cover.
    my @open = @family;
    my @hit;
    while (@open) {
        my $e = pick_best(@open);
        last unless defined $e;
        push @hit, $e;
        @open = grep { !exists $_->{$e} } @open;
    }
    print "hitting set: @hit\n";
  4. 04
    Drive it on a small example. The output is a set of elements that touches every input set — within an ln(n) factor of the minimum.
    $ echo 'a b c
    b d
    c e
    a e' | perl hitting.pl
    hitting set: a b c
Grammar — Perl

Perl is the duct tape of the Unix world. Its sigils — $ for scalars, @ for arrays, % for hashes — are noisy but readable, and its regex/text-handling primitives are unmatched for the kind of log-and-corpus work hitting set lives in.

use strict; use warnings;
always include these. They turn typos and unscoped variables into errors instead of subtle bugs.
my $x = 1;
lexically scoped scalar. The sigil $ marks single values.
my @arr = (1, 2, 3);
an array. The @ sigil marks "many values."
my %h = (a => 1);
a hash (associative array). The % sigil marks key/value pairs.
\%h
a hash reference — a scalar pointing at the hash. Written with a leading backslash.
$ref->{key}
arrow dereference — read the `key` slot of a hash referenced by $ref.
split /pat/, $str
regex split — break a string on every match of pattern, return a list.
sort { $a <=> $b } @list
sorted list. The block defines the comparison; <=> is numeric three-way compare.
grep { COND } @list
filter — keep elements where COND is true. Companion: map { EXPR } @list for transformation.
Read each line as a set of items, encode as a hash whose keys are the items. The hash-as-set idiom: only the keys matter, values are just truthy markers.
my @family;
while (my $line = <STDIN>) {
    chomp $line;
    my %set = map { $_ => 1 } split /\s+/, $line;
    push @family, \%set;
}
# now $family[0]->{"alpha"} is 1 if "alpha" appeared on line 1.
Perl — getting started →

Scan to come back to page 44

← 43 Hit them all
2/3 · 44/63
45 Test suite minimization →