comment on

I've got a rather large (327M) file of multiple line records to process. I'm picking up each record, pulling the key out, and then looking into a hash to see if it's a 'keeper' or not.

Records look like this:

##REPORT      01 A ARCUR1GMU# 00112106 F               N ARCUR1   ARCU
+R1     P
              02 AR CURRENCY CONVERSION CONTROL - (BY PTR)            
+        N
              03 FROM FVENDAP1
              04 CODE: 10  RECORDING & REPORTING GMU / NON-WHQ ONLY RE
+PORTS
              07 00112704 00200103
              11 R 3 RD.RDSMODEL                                  C 00
+08 0001
              12                                     A 0007 000 000 00
+0 000
              13 VDLTP02
              14 S N   VDLTB02
##REPORT      01 A ARCUR10000 00112106 F               N ARCUR1   ARCU
+R1     P
              02 AR CURRENCY CONVERSION CONTROL - (BY PTR)            
+        N
              07 00112704 00200103
              11 R 3 RD.RDSMODEL                                  C 00
+08 0001
              12                                     A 0007 000 000 00
+0 000
              13 VDLTP02
              14 S N   VDLTB02
[download]

I've got a regex that technically works, but the performance is very poor (over 1 sec. per record). I wondered about slurping the whole file in, but I'm not swaping at all, so that 327M is all in memory -- shouldn't be a problem, right?. My best hunch is that the regex is smarter than I am, and it is doing some heavy-duty backtracking that I'm not understanding.

I think the regex is starting to capture at double-hashes. Then, continue non-greedy matching anything (including newline), until a positive-lookahead of either a) more double-hashes, denoting a new record or b) EOF.

My code looks like this:

#!/usr/bin/perl
use strict;

# input file of ##REPORT cards
my $str = do {local $/ = undef; <STDIN>};

# keepers
my %keeplist;
open (RIDS, '<', 'rids_that_have_versions.txt') or die;
while (<RIDS>) { chomp; $keeplist{$_} = 1;}

my @cards;

while (   $str =~ /(##.*?)(?=(##|\Z))/gs )  {
  my $rid = substr($1, 19, 10);
  print "$rid: " . ($keeplist{$rid} ? 'y' : 'n') . "\n";
}
[download]

Update:

This is definitely a non-linear problem...

Lines Seconds

20k 7 s.

40k 25 s.

80k 102 s.

Another Update:

Prompted by some insightful responses, I decided to minimize my dependence on regex in this case. I decided to buffer the file manually, and got the 80k line test time down to just over 1.2 seconds.

while (<STDIN>) {
    if ( /^##/ ) {
        my $rid = substr($buffer, 19, 10);
        $rid =~ s/\s*$//g;
        print $buffer if $keeplist{$rid};
        $buffer = '';
    }
    $buffer .= $_;
}
[download]

YAU:

The performance problem w/ the regex is directly addressed and explained quite well by TheDamian in his must-read book Perl Best Practices. I almost want to keep the secret 'cause I feel so strongly that every perl programer should have this book. But... The extenstive tracking and back-tracking from .* seems to be the problem. Read about it in 'Unconstrained Repetitions' on p. 250.

Thanks for taking a look...

In reply to Multi-line Regex Performance by pboin

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.