in reply to Regexp and OCR
You've managed to avoid compiling the same regexps over and over again. That's good, but it's unfortunate because this common problem is easy to fix and fixing it produces great results.
Move (?{$num}) to the *end* the match. You're calling up to 8000 Perl subs when you only need to call one. (All 8000 when you need to call none if there's no match.)
Are you using Perl 5.10? If not, you should use Regexp::Assemble instead of join '|'. Both factor out common prefixes in patterns for faster matches in big alternations. This will require the change in paragraph two to kick in.
becomes$re{"$first$last"}||="(?{$acct})$first\\s*$last|(?{$acct})$la +st,?\\s+$first"; $first=substr($origfirst,0,1); $re2{"$first$last"}||="(?{$acct})\\b$last,?\\s+$first" unless exists $hExclude->{lc $origlast} or length($last)<4 +;
andpush @re, "$first\\s*$last(?{$acct})", "$last,?\\s+$first(?{$acct})" if !$re{"$first$last"}++; $first=substr($origfirst,0,1); push @re2, "\\b$last,?\\s+$first(?{$acct})" if !exists($hExclude->{lc $origlast}) && length($last)>=4; && !$re2{"$first$last"}++
becomes$re=join('|',sort ocr_sort values %re); $re2=join('|',sort ocr_sort values %re2);
# 5.10.0 and higher $re = join('|', sort ocr_sort @re); $re2 = join('|', sort ocr_sort @re2);
# Any version of Perl $re = do { my $ra = Regexp::Assemble->new(); $ra->add($_) for sort ocr_sort @re; $ra->re }; $re2 = { my $ra = Regexp::Assemble->new(); $ra->add($_) for sort ocr_sort @re2; $ra->re };
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Regexp and OCR
by sflitman (Hermit) on Jun 20, 2009 at 09:15 UTC | |
by kdejonghe (Novice) on Jun 21, 2009 at 19:42 UTC | |
|
Re^2: Regexp and OCR
by sflitman (Hermit) on Jun 27, 2009 at 22:21 UTC |