I've got a rather large (327M) file of multiple line records to process. I'm picking up each record, pulling the key out, and then looking into a hash to see if it's a 'keeper' or not.

Records look like this:

##REPORT 01 A ARCUR1GMU# 00112106 F N ARCUR1 ARCU +R1 P 02 AR CURRENCY CONVERSION CONTROL - (BY PTR) + N 03 FROM FVENDAP1 04 CODE: 10 RECORDING & REPORTING GMU / NON-WHQ ONLY RE +PORTS 07 00112704 00200103 11 R 3 RD.RDSMODEL C 00 +08 0001 12 A 0007 000 000 00 +0 000 13 VDLTP02 14 S N VDLTB02 ##REPORT 01 A ARCUR10000 00112106 F N ARCUR1 ARCU +R1 P 02 AR CURRENCY CONVERSION CONTROL - (BY PTR) + N 07 00112704 00200103 11 R 3 RD.RDSMODEL C 00 +08 0001 12 A 0007 000 000 00 +0 000 13 VDLTP02 14 S N VDLTB02

I've got a regex that technically works, but the performance is very poor (over 1 sec. per record). I wondered about slurping the whole file in, but I'm not swaping at all, so that 327M is all in memory -- shouldn't be a problem, right?. My best hunch is that the regex is smarter than I am, and it is doing some heavy-duty backtracking that I'm not understanding.

I think the regex is starting to capture at double-hashes. Then, continue non-greedy matching anything (including newline), until a positive-lookahead of either a) more double-hashes, denoting a new record or b) EOF.

My code looks like this:

#!/usr/bin/perl use strict; # input file of ##REPORT cards my $str = do {local $/ = undef; <STDIN>}; # keepers my %keeplist; open (RIDS, '<', 'rids_that_have_versions.txt') or die; while (<RIDS>) { chomp; $keeplist{$_} = 1;} my @cards; while ( $str =~ /(##.*?)(?=(##|\Z))/gs ) { my $rid = substr($1, 19, 10); print "$rid: " . ($keeplist{$rid} ? 'y' : 'n') . "\n"; }

Update:

This is definitely a non-linear problem...
LinesSeconds
20k7 s.
40k25 s.
80k102 s.

Another Update:

Prompted by some insightful responses, I decided to minimize my dependence on regex in this case. I decided to buffer the file manually, and got the 80k line test time down to just over 1.2 seconds.

while (<STDIN>) { if ( /^##/ ) { my $rid = substr($buffer, 19, 10); $rid =~ s/\s*$//g; print $buffer if $keeplist{$rid}; $buffer = ''; } $buffer .= $_; }

YAU:

The performance problem w/ the regex is directly addressed and explained quite well by TheDamian in his must-read book Perl Best Practices. I almost want to keep the secret 'cause I feel so strongly that every perl programer should have this book. But... The extenstive tracking and back-tracking from .* seems to be the problem. Read about it in 'Unconstrained Repetitions' on p. 250.

Thanks for taking a look...


In reply to Multi-line Regex Performance by pboin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.