My general question is how to improve the performance of a regular expression. I have a fairly simple log analysis program, and I've profiled it, and it spends more time matching the regular expressions that I've made for it than anything else. This may be because my input is rather large, and my "analysis" is to feed it to Text::CSV.

In any case, it occurred to me that I have no idea how to improve matching performance. I'm assuming there is not something like Devel::NYTProf for regular expressions. I don't know much about how the engine works under the hood, so I can't make any guesses about what might be taking the most time. I know that it's possible to write an expression that takes a year and a day to match, but I don't know the particulars.

The actual expressions I'm using are in the readmore block below. My input files are Apache logs in a somewhat customized format.

# Things like ( (?:internal_ip){0} . ) were originally Perl 5.10 # named captures like (?<internal_ip> . ), but I don't # expect to have 5.10 on the target. # [04/Oct/2009:06:25:20 -0500] my $time_rx = qr{ \[ ( (?:day){0} \d\d ) / ( (?:mon){0} Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec ) / ( (?:year){0} \d{4} ) : ( (?:hour){0} \d\d ) : ( (?:min){0} \d\d ) : ( (?:sec){0} \d\d ) \s ( (?:tz){0} [+-] \d{4} ) \] }xms; my @time_rx_fields = qw( day mon year hour min sec tz ); my $request_rx = qr{ \" ( (?:method){0} [A-Z]+ ) \s ( (?:url){0} .*? ) \s ( (?:protocol){0} HTTP/1.\d ) \" }xms; my @request_rx_fields = qw( method url protocol ); my $ip_rx = qr{ [12]?\d?\d (?: \. [12]?\d?\d ){3} }xms; my $pid_log_rx = qr{ ^ ( (?:user){0} \S+ ) \s $time_rx \s $request_rx \s ( (?:status){0} \d{3} ) \s ( (?:bytes){0} - | \d+ ) \s \[ ( (?:pid){0} \d+ ) \] \s* $ }xms; my @pid_log_rx_fields = ( 'user', @time_rx_fields, @request_rx_fields, 'status', 'bytes', 'pid' ); my $frontend_log_rx = qr{ ^ ( (?:external_ip){0} $ip_rx ) \s* , \s* ( (?:internal_ip){0} $ip_rx ) \s ( (?:group){0} \S+ ) \s ( (?:user){0} \S+ ) \s $time_rx \s $request_rx \s ( (?:status){0} \d{3} ) \s ( (?:bytes){0} - | \d+ ) \s ( (?:ms){0} \d+ ) \s* $ }xms; my @frontend_log_rx_fields = ( 'external_ip', 'internal_ip', 'group', 'user', @time_rx_fields, @request_rx_fields, 'status', 'bytes', 'ms' ); # Those get used here: sub line_in { my ( $fh, $rx, $fields_ref ) = @_; return if ref $fields_ref ne ref []; my $line = <$fh>; return if ! defined $line; my @captures = $line =~ $rx; return if ! @captures; return { line => $line, map { $fields_ref->[$_] => $captures[$_] } 0 .. $#{$fields_ref} }; }

I'd welcome any suggestions on this particular problem, but I'm really interested in the more general question of how I could answer this question myself. Are there heuristics you've learned? Is there a tutorial on this topic? Any guidance would be appreciated!


In reply to How do I optimize a regular expression? by kyle

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.