sflitman has asked for the wisdom of the Perl Monks concerning the following question:

I have a modest database of people's names and documents which I OCR with the wonderful PDF::OCR2 and Tesseract. I have written some very convoluted code which builds a gigantic regexp to match the OCR'd text to. It tries to handle a number of cases of distorted letters and misspellings, and it remarkably actually works fairly well on incoming faxes from Hylafax, but it is SLOW and it is not scaling at all well with 8000 names in the database.
# (part of much larger codebase which uses strict) my %TPL; # hash for storing objects which persist between invocation +s of ocr_name sub trim_word { # trim words my $ref=shift; $$ref=~s/\([^)]+\)//g; $$ref=~s/\s+$//; $$ref=~s/^\s+//; $$ref=~s/[-\s]/.*/g; # internal hyphens or spaces } sub fuzzify_word { # try to relax matching for common OCR mistakes my $ref=shift; $$ref=~s/O/[OD]/ig; # O <-> O or D $$ref=~s/G/[gc]/ig; # uc G <-> C $$ref=~s/[ij]/[ij]/g; # lc I <-> J $$ref=~s/[HN]/[hn]/ig; # uc H <-> N $$ref=~s/W/[wn]/ig; # uc W <-> N $$ref=~s/T/[it]/ig; # T or t <-> i or t $$ref=~s/C/[CL]/ig; # C or c <-> C or L } sub ocr_sort { # sorting function, put longer names first length($b) <=> length($a); } sub ocr_name { # examine a file with OCR to get a name my ($file,$hExclude,$hIgnore)=@_; # file, hash of excluded word +s, hash of ignored words my ($first,$last,$acct,$origfirst,$origlast,$text); unless ($TPL{ocr_match}) { # regenerate big hash & regexps my ($sth,$re,$re2,%re,%re2); $sth=sql("select lastname,firstname,acct from names order by acct desc"); # calls DBI (code not shown), +desc so more recent accounts which will be matched earlier (maybe) (? +) while (($last,$first,$acct)=$sth->fetchrow) { trim_word(\$first); trim_word(\$last); $origfirst=$first; $origlast=$last; fuzzify_word(\$first); fuzzify_word(\$last); $re{"$first$last"}||="(?{$acct})$first\\s*$last|(?{$acct})$la +st,?\\s+$first"; # use ?{} to store account if this branch matches # catch unusual last names $first=substr($origfirst,0,1); $re2{"$first$last"}||="(?{$acct})\\b$last,?\\s+$first" unless exists $hExclude->{lc $origlast} or length($last)<4 +; } $sth->finish; $re=join('|',sort ocr_sort values %re); $re2=join('|',sort ocr_sort values %re2); $TPL{ocr_match}=sub { my $text=shift; my @matches; die "Bad re $re" unless index($re,'|')>0; use re 'eval'; while ($text=~/$re/gis) { push @matches,$^R; } @matches; }; $TPL{ocr_match2}=sub { my $text=shift; my @matches; die "Bad re $re2" unless index($re2,'|')>0; use re 'eval'; while ($text=~/$re2/gis) { push @matches,$^R; } @matches; }; } $text=run_ocr($file); # convert file and invoke OCR, returning raw + OCR'd text output $ignored=join('|',keys %$hIgnore); $text=~s/($ignore)//oig; # delete words which cause false + hits map { $match{$_}++ } $TPL{ocr_match}->($text); # call precompile +d proc map { $match{$_}++ } $TPL{ocr_match2}->($text); return keys %match; # array of accounts of named individuals in doc +ument }
Any and all comments appreciated, I know I'm doing something dumb here. The computed $re is now 377,000 bytes long and $re2 is 141,000 bytes long. These are huge REs which look like this:
...(?{474})[CL]r[ij]s[it][ij]a[hn][ij],?\s+S[it]eve|(?{487})... # that's matching fictitious account 00474, Mr. Steve Christian

Humbly,
SSF

Replies are listed 'Best First'.
Re: Regexp and OCR
by ikegami (Patriarch) on Jun 20, 2009 at 06:41 UTC

    You've managed to avoid compiling the same regexps over and over again. That's good, but it's unfortunate because this common problem is easy to fix and fixing it produces great results.

    Move (?{$num}) to the *end* the match. You're calling up to 8000 Perl subs when you only need to call one. (All 8000 when you need to call none if there's no match.)

    Are you using Perl 5.10? If not, you should use Regexp::Assemble instead of join '|'. Both factor out common prefixes in patterns for faster matches in big alternations. This will require the change in paragraph two to kick in.


    So
    $re{"$first$last"}||="(?{$acct})$first\\s*$last|(?{$acct})$la +st,?\\s+$first"; $first=substr($origfirst,0,1); $re2{"$first$last"}||="(?{$acct})\\b$last,?\\s+$first" unless exists $hExclude->{lc $origlast} or length($last)<4 +;
    becomes
    push @re, "$first\\s*$last(?{$acct})", "$last,?\\s+$first(?{$acct})" if !$re{"$first$last"}++; $first=substr($origfirst,0,1); push @re2, "\\b$last,?\\s+$first(?{$acct})" if !exists($hExclude->{lc $origlast}) && length($last)>=4; && !$re2{"$first$last"}++
    and
    $re=join('|',sort ocr_sort values %re); $re2=join('|',sort ocr_sort values %re2);
    becomes
    # 5.10.0 and higher $re = join('|', sort ocr_sort @re); $re2 = join('|', sort ocr_sort @re2);
    # Any version of Perl $re = do { my $ra = Regexp::Assemble->new(); $ra->add($_) for sort ocr_sort @re; $ra->re }; $re2 = { my $ra = Regexp::Assemble->new(); $ra->add($_) for sort ocr_sort @re2; $ra->re };
      Very much appreciated! I never realized that the ?{} was a sub, but of course it is. I am very impressed with Regexp::Assemble. I am using 5.8.8 on the server where my app is located, but 5.10 on my laptops.

      SSF

        Regexp::Assemble is absolutely fantastic. I've been using it since perl 5.8.6 and it never let me down.
      A quick note. Regexp::Assemble barfs on patterns containing ?{} in Perl 5.8.8 because of the error Eval-group not allowed at runtime, use re 'eval' in regex m/.../. which is inside Assemble.pm in the _build_re routine's else clause. I didn't report this as a bug because under 5.8.8 you really should use the module's tracking feature, which contains the $^R workaround. Interestingly, modifying the module (I know, I know ;-) to add the use re 'eval'; line in that else clause makes the error disappear, but the resulting regexp doesn't work. My production code finally anticipates Perl 5.10 and just joins the individual expressions with |, and with the ?{acct} at the ends like Ikegami said, and the speedup was phenomenal! It can only get better with 5.10 trie building, but I'm not quite ready to upgrade my server yet.

      SSF

Re: Regexp and OCR
by ambrus (Abbot) on Jun 28, 2009 at 11:06 UTC

    Apparently all your transformations are removing insignificant characters (hyphens and spaces) and taking some classes of characters as equivalent (like "i" and "j"). In this case I think you don't even need regexen. Just choose a representing character from each class, then preprocess your list of names by removing the insignificant characters and normalizing the easy to confuse characters to the representing character. For example, you'd add a column to your database where this normalized name would be stored, fill it with the names with spaces removed and all "j" replaced with "i" etc, and index on this column. Then, when you ocr a name, you just normalize it the same way and search for the normalized string in this column.

Re: Regexp and OCR
by dk (Chaplain) on Jun 22, 2009 at 12:43 UTC
    I'm thinking aloud, apologies if it is something uninteresting, but here are my thoughts. In your example, f.ex. input can be "Crjstjan" and regex matching it "[CL]r[ij]s[it][ij]a[hn]", which I assume burdens the regex engine somewhat to look for alterations. I wonder it would be faster to make a pre-run on $text so it is first transformed in a text containing the alterations verbatim, e.g. make all instances of "Cristian" and "Crjstjan" to look like "<CL>r<ij><it><ij>a<hn>" (<> instead of [] just for the sake of visual difference) in the first place.

    Possibly that helps with the speed, and if it does indeed, then things may become interesting. First, the regex matching the names can be then a simple concatenation of lexems like above, and second, its not necessarily that the second regex run would be needed at all, a trivial hash replacement would be enough, something along the following:

    my %replace = ( "<CL>r<ij><it><ij>a<hn>" => "Christian", ... ); $text =~ s/\b(\w+)\b/exists($replace{$1}) ? $replace{$1} : $1/ge;
    ( I know it is naive, I've seen that you match sentences, not individual words, but still ).

    Again, if the alterations only consist of max 4 characters, I'm thinking that instead of composing them into "<ab>" structure, one can make them into a single unicode character f.ex. (pack("U1"), (ord("a") << 8) + ord("b")), and thus possibly gaining some extra milliseconds.

      I'd suggest coding up the two and running them with Benchmark and post the results. For my problem, this wouldn't work because the OCR text is already variable, so replacing Cristian in that text would kind of be the same problem as identifying Cristian as account number 14222 in the first place.

      SSF