You've managed to avoid compiling the same regexps over and over again. That's good, but it's unfortunate because this common problem is easy to fix and fixing it produces great results.
Move (?{$num}) to the *end* the match. You're calling up to 8000 Perl subs when you only need to call one. (All 8000 when you need to call none if there's no match.)
Are you using Perl 5.10? If not, you should use Regexp::Assemble instead of join '|'. Both factor out common prefixes in patterns for faster matches in big alternations. This will require the change in paragraph two to kick in.
becomes$re{"$first$last"}||="(?{$acct})$first\\s*$last|(?{$acct})$la +st,?\\s+$first"; $first=substr($origfirst,0,1); $re2{"$first$last"}||="(?{$acct})\\b$last,?\\s+$first" unless exists $hExclude->{lc $origlast} or length($last)<4 +;
andpush @re, "$first\\s*$last(?{$acct})", "$last,?\\s+$first(?{$acct})" if !$re{"$first$last"}++; $first=substr($origfirst,0,1); push @re2, "\\b$last,?\\s+$first(?{$acct})" if !exists($hExclude->{lc $origlast}) && length($last)>=4; && !$re2{"$first$last"}++
becomes$re=join('|',sort ocr_sort values %re); $re2=join('|',sort ocr_sort values %re2);
# 5.10.0 and higher $re = join('|', sort ocr_sort @re); $re2 = join('|', sort ocr_sort @re2);
# Any version of Perl $re = do { my $ra = Regexp::Assemble->new(); $ra->add($_) for sort ocr_sort @re; $ra->re }; $re2 = { my $ra = Regexp::Assemble->new(); $ra->add($_) for sort ocr_sort @re2; $ra->re };
In reply to Re: Regexp and OCR
by ikegami
in thread Regexp and OCR
by sflitman
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |