in reply to Regexp and OCR

You've managed to avoid compiling the same regexps over and over again. That's good, but it's unfortunate because this common problem is easy to fix and fixing it produces great results.

Move (?{$num}) to the *end* the match. You're calling up to 8000 Perl subs when you only need to call one. (All 8000 when you need to call none if there's no match.)

Are you using Perl 5.10? If not, you should use Regexp::Assemble instead of join '|'. Both factor out common prefixes in patterns for faster matches in big alternations. This will require the change in paragraph two to kick in.


So
$re{"$first$last"}||="(?{$acct})$first\\s*$last|(?{$acct})$la +st,?\\s+$first"; $first=substr($origfirst,0,1); $re2{"$first$last"}||="(?{$acct})\\b$last,?\\s+$first" unless exists $hExclude->{lc $origlast} or length($last)<4 +;
becomes
push @re, "$first\\s*$last(?{$acct})", "$last,?\\s+$first(?{$acct})" if !$re{"$first$last"}++; $first=substr($origfirst,0,1); push @re2, "\\b$last,?\\s+$first(?{$acct})" if !exists($hExclude->{lc $origlast}) && length($last)>=4; && !$re2{"$first$last"}++
and
$re=join('|',sort ocr_sort values %re); $re2=join('|',sort ocr_sort values %re2);
becomes
# 5.10.0 and higher $re = join('|', sort ocr_sort @re); $re2 = join('|', sort ocr_sort @re2);
# Any version of Perl $re = do { my $ra = Regexp::Assemble->new(); $ra->add($_) for sort ocr_sort @re; $ra->re }; $re2 = { my $ra = Regexp::Assemble->new(); $ra->add($_) for sort ocr_sort @re2; $ra->re };

Replies are listed 'Best First'.
Re^2: Regexp and OCR
by sflitman (Hermit) on Jun 20, 2009 at 09:15 UTC
    Very much appreciated! I never realized that the ?{} was a sub, but of course it is. I am very impressed with Regexp::Assemble. I am using 5.8.8 on the server where my app is located, but 5.10 on my laptops.

    SSF

      Regexp::Assemble is absolutely fantastic. I've been using it since perl 5.8.6 and it never let me down.
Re^2: Regexp and OCR
by sflitman (Hermit) on Jun 27, 2009 at 22:21 UTC
    A quick note. Regexp::Assemble barfs on patterns containing ?{} in Perl 5.8.8 because of the error Eval-group not allowed at runtime, use re 'eval' in regex m/.../. which is inside Assemble.pm in the _build_re routine's else clause. I didn't report this as a bug because under 5.8.8 you really should use the module's tracking feature, which contains the $^R workaround. Interestingly, modifying the module (I know, I know ;-) to add the use re 'eval'; line in that else clause makes the error disappear, but the resulting regexp doesn't work. My production code finally anticipates Perl 5.10 and just joins the individual expressions with |, and with the ?{acct} at the ends like Ikegami said, and the speedup was phenomenal! It can only get better with 5.10 trie building, but I'm not quite ready to upgrade my server yet.

    SSF