comment on

You've managed to avoid compiling the same regexps over and over again. That's good, but it's unfortunate because this common problem is easy to fix and fixing it produces great results.

Move (?{$num}) to the *end* the match. You're calling up to 8000 Perl subs when you only need to call one. (All 8000 when you need to call none if there's no match.)

Are you using Perl 5.10? If not, you should use Regexp::Assemble instead of join '|'. Both factor out common prefixes in patterns for faster matches in big alternations. This will require the change in paragraph two to kick in.

         $re{"$first$last"}||="(?{$acct})$first\\s*$last|(?{$acct})$la
+st,?\\s+$first";

         $first=substr($origfirst,0,1);
         $re2{"$first$last"}||="(?{$acct})\\b$last,?\\s+$first" 
            unless exists $hExclude->{lc $origlast} or length($last)<4
+;
[download]

becomes

         push @re, "$first\\s*$last(?{$acct})",
                   "$last,?\\s+$first(?{$acct})"
            if !$re{"$first$last"}++;

         $first=substr($origfirst,0,1);
         push @re2, "\\b$last,?\\s+$first(?{$acct})"
            if !exists($hExclude->{lc $origlast})
            && length($last)>=4;
            && !$re2{"$first$last"}++
[download]

and

      $re=join('|',sort ocr_sort values %re);
      $re2=join('|',sort ocr_sort values %re2);
[download]

becomes

      # 5.10.0 and higher
      $re  = join('|', sort ocr_sort @re);
      $re2 = join('|', sort ocr_sort @re2);
[download]

      # Any version of Perl
      $re = do {
         my $ra = Regexp::Assemble->new();
         $ra->add($_) for sort ocr_sort @re;
         $ra->re
      };
      $re2 = {
         my $ra = Regexp::Assemble->new();
         $ra->add($_) for sort ocr_sort @re2;
         $ra->re
      };
[download]

In reply to Re: Regexp and OCR by ikegami
in thread Regexp and OCR by sflitman

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.