I decided to take a stab at optimizing the regular expression itself. Here's what I did. First, I backed out of your regular expression all the possible strings that would match exactly. Then I used those as the basis for new patterns that had no alternation at all. I sorted each pattern to get similarities grouped together. Then I categorized them by common keyword, and further subcategorized them by whether that keyword had additional text after it, or before it (or both). That provided the basis for new alternation logic.

From here on out I did everything programatically: Sorted all of the alternations in order from greatest length to shortest length. Then sorted each of the groupings in order from longest to shortest, again forming another layer of alternations.

The goal was to bite off the biggest possible chunks first, smaller later. And to never have more than two layers of alternation in any given potential match.

Here's the resulting code (and I apologize right now for the fact that I haven't taken advantage of the /x modifier; the regexes were constructed programatically):

#!/usr/bin/env perl use strict; use warnings; use Benchmark qw( cmpthese ); my @patterns = ( '(?:Internal\s+Revenue\s+Service|invaluable\s+practical|revenu +e\s+recognition|Treasury\s+Department|ETIF\s+accounting|implementatio +n|administrative|authoritative|interpretive|superceding|regulatory|ap +plicable|accounting|definitive|government|any\s+ruling|transition|pro +cedural|receiving|technical|under\s+the|staff\'s|judicial|Treasury|va +luable|specific|college|related|interim|valued|absent|their|his|SEC|F +DA|IRS)\s+guidance', 'provided\s+by\s+(?:the\s+Securities\s+and\s+Exchange\s+Commis +sion|Securities\s+and\s+Exchange\s+Commission|the\s+Financial\s+Accou +nting|Financial\s+Accounting|the\s+United\s+States|the\s+Secretary|Un +ited\s+States|Secretary|the\s+SEC|SEC)', 'issued\s+by\s+(?:the\s+Securities\s+and\s+Exchange\s+Commissi +on|Securities\s+and\s+Exchange\s+Commission|the\s+Financial\s+Account +ing|Financial\s+Accounting|the\s+United\s+States|the\s+Secretary|Unit +ed\s+States|Secretary|the\s+SEC|SEC)', 'assumes\s+guidance\s+of\s+(?:the\s+talented\s+team|the\s+comp +ensation|a\s+talented\s+team|a\s+compensation|the\s+company|the\s+boa +rd|a\s+company|a\s+board)', 'guidance\s+(?:promulgated\s+thereunder|promulgated|to\s+appro +ve|technology|and\s+rules|facility|software|in\s+SFAS|system)', '(?:may\s+contain\s+statements\s+about\s+future\s+events,|if\s ++on\s+negative|suggesting\s+an|Microsoft|rating)\s+outlook', 'according\s+to\s+the\s+guidance\s+contained', 'other\s+guidance\s+(?:issued|under|from)', 'provide\s+guidance\s+to\s+directors', 'current\s+guidance\s+(?:under|from)', 'applicable\s+guidance\s+issued', 'outlook\s+for\s+any\s+rating', ); my $alternations = '\b' . join( '\b|\b', @patterns ) . '\b'; $main::regex_opt = qr/$alternations/; $main::regex_old = qr/outlook\s+for\s+any\s+rating|(?:rating|if\s+on\s ++negative|Microsoft|suggesting\s+an|may\s+contain\s+statements\s+abou +t\s+future\s+events\,|business\s+conditions\s+and\s+the)\s+outlook|gu +idance\s+(?:to\s+approve|facility) |(?:authoritative|revenue\s+recognition|invaluable\s ++practical|valuable|regulatory|technical|under\s+the|staff\'s|judicia +l|SEC|FDA|Treasury(?:\s+Department)?|specific|implementation|their|go +vernment|any\s+ruling|college|absent|\s+his|interim|interpretive|tran +sition|administrative|procedural|related|applicable|accounting|defini +tive|superceding|IRS|Internal\s+Revenue\s+Service|valued|EITF\s+accou +nting)\s+guidance |guidance\s+(?:and\s+rules|promulgated(?:\s+thereund +er)?|in\s+SFAS)|(?:provided|issued)\s+by\s+(?:the\s+)?(?:SEC|Securiti +es\s+and\s+Exchange\s+Commission|Internal\s+Revenue\s+Service|Secreta +ry|United\s+States|Financial\s+Accounting) |(?:other|applicable)\s+guidance\s+issued|according\ +s+to\s+the\s+guidance\s+contained|provide\s+guidance\s+to\s+directors +|receiving\s+guidance |(?:current|other)\s+guidance\s+(?:under|from)|assum +es\s+guidance\s+of\s+(?:the|a)\s+(?:company|board|talented\s+team|com +pensation)|guidance\s+(?:system|software|technology) /xi; $main::string = do { local $/ = undef; <DATA>; }; sub reopt { my $counter = () = $main::string =~ m/$main::regex_opt/g; return $counter; } sub reold { my $counter = () = $main::string =~ m/$main::regex_old/g; return $counter; } print reopt(), "\n"; print reold(), "\n"; cmpthese ( -10, { reopt => \&reopt, reold => \&reold, } ); __DATA__ outlook for any rating rating outlook if on negative outlook Microsoft outlook suggesting an outlook may contain statements about future events, outlook guidance to approve guidance facility authoritative guidance revenue recognition guidance invaluable practical guidance valuable guidance regulatory guidance technical guidance under the guidance staff's guidance judicial guidance SEC guidance FDA guidance Treasury guidance Treasury Department guidance specific guidance implementation guidance their guidance government guidance any ruling guidance college guidance absent guidance his guidance interim guidance interpretive guidance transition guidance administrative guidance procedural guidance related guidance applicable guidance accounting guidance definitive guidance superceding guidance IRS guidance Internal Revenue Service guidance valued guidance ETIF accounting guidance guidance and rules guidance promulgated guidance promulgated thereunder guidance in SFAS provided by the SEC provided by the Securities and Exchange Commission provided by the Secretary provided by the United States provided by the Financial Accounting provided by SEC provided by Securities and Exchange Commission provided by Secretary provided by United States provided by Financial Accounting issued by the SEC issued by the Securities and Exchange Commission issued by the Secretary issued by the United States issued by the Financial Accounting issued by SEC issued by Securities and Exchange Commission issued by Secretary issued by United States issued by Financial Accounting other guidance issued applicable guidance issued according to the guidance contained provide guidance to directors receiving guidance current guidance under other guidance under current guidance from other guidance from assumes guidance of the company assumes guidance of the board assumes guidance of the talented team assumes guidance of the compensation assumes guidance of a company assumes guidance of a board assumes guidance of a talented team assumes guidance of a compensation guidance system guidance software guidance technology

And here's the benchmark output:

87 87 Rate reold reopt reold 1318/s -- -75% reopt 5189/s 294% --

...and on a more beefy box...

87 87 Rate reold reopt reold 2585/s -- -74% reopt 10114/s 291% --

Both were run with Perl v5.16.1.

It would be a much better test if I had your actual input data. Instead, I have all the possible ${^MATCH} terms (in the __DATA__ segment). I'm only verifying that all the possible matches are matching. You'll have to verify that rejections are occurring as they should, by testing against some of your real data.

As you can see, we get almost a 3x performance improvement by optimizing the alternations. But the data set I'm testing with is highly contrived. The only way to really know how much this improves upon your existing solution is to benchmark it with real data. And I still believe there is value in knowing what other bottlenecks you might have, which can be done by profiling your actual code with a real dataset.


Dave


In reply to Re: Help with speeding up regex by davido
in thread Help with speeding up regex by eversuhoshin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.