Multi-line Regex Performance

pboin has asked for the wisdom of the Perl Monks concerning the following question:

I've got a rather large (327M) file of multiple line records to process. I'm picking up each record, pulling the key out, and then looking into a hash to see if it's a 'keeper' or not.

Records look like this:

##REPORT      01 A ARCUR1GMU# 00112106 F               N ARCUR1   ARCU
+R1     P
              02 AR CURRENCY CONVERSION CONTROL - (BY PTR)            
+        N
              03 FROM FVENDAP1
              04 CODE: 10  RECORDING & REPORTING GMU / NON-WHQ ONLY RE
+PORTS
              07 00112704 00200103
              11 R 3 RD.RDSMODEL                                  C 00
+08 0001
              12                                     A 0007 000 000 00
+0 000
              13 VDLTP02
              14 S N   VDLTB02
##REPORT      01 A ARCUR10000 00112106 F               N ARCUR1   ARCU
+R1     P
              02 AR CURRENCY CONVERSION CONTROL - (BY PTR)            
+        N
              07 00112704 00200103
              11 R 3 RD.RDSMODEL                                  C 00
+08 0001
              12                                     A 0007 000 000 00
+0 000
              13 VDLTP02
              14 S N   VDLTB02
[download]

I've got a regex that technically works, but the performance is very poor (over 1 sec. per record). I wondered about slurping the whole file in, but I'm not swaping at all, so that 327M is all in memory -- shouldn't be a problem, right?. My best hunch is that the regex is smarter than I am, and it is doing some heavy-duty backtracking that I'm not understanding.

I think the regex is starting to capture at double-hashes. Then, continue non-greedy matching anything (including newline), until a positive-lookahead of either a) more double-hashes, denoting a new record or b) EOF.

My code looks like this:

#!/usr/bin/perl
use strict;

# input file of ##REPORT cards
my $str = do {local $/ = undef; <STDIN>};

# keepers
my %keeplist;
open (RIDS, '<', 'rids_that_have_versions.txt') or die;
while (<RIDS>) { chomp; $keeplist{$_} = 1;}

my @cards;

while (   $str =~ /(##.*?)(?=(##|\Z))/gs )  {
  my $rid = substr($1, 19, 10);
  print "$rid: " . ($keeplist{$rid} ? 'y' : 'n') . "\n";
}
[download]

Update:

This is definitely a non-linear problem...

Lines Seconds

20k 7 s.

40k 25 s.

80k 102 s.

Another Update:

Prompted by some insightful responses, I decided to minimize my dependence on regex in this case. I decided to buffer the file manually, and got the 80k line test time down to just over 1.2 seconds.

while (<STDIN>) {
    if ( /^##/ ) {
        my $rid = substr($buffer, 19, 10);
        $rid =~ s/\s*$//g;
        print $buffer if $keeplist{$rid};
        $buffer = '';
    }
    $buffer .= $_;
}
[download]

YAU:

The performance problem w/ the regex is directly addressed and explained quite well by TheDamian in his must-read book Perl Best Practices. I almost want to keep the secret 'cause I feel so strongly that every perl programer should have this book. But... The extenstive tracking and back-tracking from .* seems to be the problem. Read about it in 'Unconstrained Repetitions' on p. 250.

Thanks for taking a look...

Comment on Multi-line Regex Performance Select or Download Code

Replies are listed 'Best First'.
Re: Multi-line Regex Performance by BrowserUk (Patriarch) on Nov 01, 2005 at 15:52 UTC
You could avoid the regex engine entirely, and also the risk of swapping should your file(s) get bigger, by setting `$/ = "##";` and reading the file one multi-line record at a time: #! perl -slw use strict; while( $_ = do{ local $/ = "##REPORT"; <DATA> } ) { next if $. == 1; ##skip first empty record. print substr $_, 11, 10; } __DATA__ ##REPORT 01 A ARCUR1GMU# 00112106 F N ARCUR1 ARCU +R1 P 02 AR CURRENCY CONVERSION CONTROL - (BY PTR) + N 03 FROM FVENDAP1 04 CODE: 10 RECORDING & REPORTING GMU / NON-WHQ ONLY RE +PORTS 07 00112704 00200103 11 R 3 RD.RDSMODEL C 00 +08 0001 12 A 0007 000 000 00 +0 000 13 VDLTP02 14 S N VDLTB02 ##REPORT 01 A ARCUR10000 00112106 F N ARCUR1 ARCU +R1 P 02 AR CURRENCY CONVERSION CONTROL - (BY PTR) + N 07 00112704 00200103 11 R 3 RD.RDSMODEL C 00 +08 0001 12 A 0007 000 000 00 +0 000 13 VDLTP02 14 S N VDLTB02 [download] This PBP-rated -4 code example produces `P:\test>junk2 ARCUR1GMU# ARCUR10000` [download] And should run considerably more quickly. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re: Multi-line Regex Performance by japhy (Canon) on Nov 01, 2005 at 15:45 UTC
Your regex is awfully slow. That's what it comes down to. You're inching forward character-by-character until you come to '##' or the end of the string. The regex could be made amazingly faster by just making this one change: `$str =~ /(##(?:.\n))(?=##\|\Z)/g` [download] That reads "lines" at a time until it gets to '##' or \Z. Another way you could write it (to be very fast) is: `$str =~ /(##(?:[^#]+\|#[^#])/g` [download] That doesn't even need the look-ahead. Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart*	[reply] [d/l] [select]
Re: Multi-line Regex Performance by japhy (Canon) on Nov 01, 2005 at 16:00 UTC
Here's another way to do the task which doesn't require reading the entire file into memory. `#!/usr/bin/perl use strict; use warnings; my %keeplist; open my $rids, '<', 'rids_that_have_versions.txt' or die $!; while (<$rids>) { chomp; $keeplist{$_} = 1;} close $rids; while (<STDIN>) { next unless /^##/; # only deal with '##' lines my $rid = substr($_, 19, 10); print "$rid: " . ($keeplist{$rid} ? 'y' : 'n') . "\n"; }` [download] Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply] [d/l]
Re: Multi-line Regex Performance by ikegami (Patriarch) on Nov 01, 2005 at 15:48 UTC
`$str =~ /\G(##.?)(?=(##\|\Z))/gs` might be faster, but it's probably the same. `$str =~ /\G(##(?:(?!$##).))/gs` is something to try. Taking advantage of the fact that '##' is at the begining of a line might help greatly. But you don't even need a regexp. Use index, which is lightyears faster: `my $sep = '##'; my $sep_len = length($sep); my $last_pos = index($str, $sep); for (;;) { my $pos = index($str, '##', $last_pos + $sep_len); my $rec; if ($pos < 0) { $rec = substr($str, $last_pos); last if not length $rec; } else { $rec = substr($str, $last_pos, $pos-$last_pos); $last_pos = $pos; } my $rid = substr($rec, 19, 10); print "$rid: " . ($keeplist{$rid} ? 'y' : 'n') . "\n"; }` [download] You can even avoid copying stuff into $rec: `my $sep = '##'; my $sep_len = length($sep); my $last_pos = index($str, $sep); for (;;) { my $pos = index($str, '##', $last_pos + $sep_len); my $rec_pos = $last_pos; if ($pos < 0) { last if $last_pos == length($str); $last_pos = length($str); } else { $last_pos = $pos; } my $rid = substr($str, $rec_pos+19, 10); print "$rid: " . ($keeplist{$rid} ? 'y' : 'n') . "\n"; }` [download] Update: Added code. Untested.	[reply] [d/l] [select]
Re: Multi-line Regex Performance by sauoq (Abbot) on Nov 01, 2005 at 15:45 UTC
Can you tell us what is supposed to be the hash key in this input? It's impossible to tell by reading the code without knowing if there are tabs in your whitespace. And, are these fixed length records? (Or even a fixed number of lines?) If there are no hashes in the records except the doubled ones, your regex could probably be as simple as `/(##[^#])/g`. (And it would be a lot* faster.) -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re^2: Multi-line Regex Performance by pboin (Deacon) on Nov 01, 2005 at 15:52 UTC
The key for checking whether one of the multi-line records is a keeper or not is at position 19 for a length of ten on the first line (segment '01' in our parlance.) There could be hash characters in some of the fields -- many of them are freeform and take addresses, comments, etc. There will not be any in the first position though, other than the ones that denote new records. Thanks sauoq.	[reply]
Re^3: Multi-line Regex Performance by sauoq (Abbot) on Nov 01, 2005 at 16:27 UTC
The key for checking whether one of the multi-line records is a keeper or not is at position 19 for a length of ten on the first line This doesn't really tell us anything that the `substr()` hadn't already told us. The thing is, we can't reliably count whitespace in the data you provided. But anyway, since what you want is always on that first line, it's probably a lot easier (and more efficient than using a regular expression) to just read the data line by line ignoring lines that don't match `/^##/` and doing what you want with the ones that do. This would have the added benefit of not keeping 300+MB in RAM. `while (<>) { next unless /^##/; my $key = substr $_, 19, 10; do_stuff_with($key); }` [download] -sauoq "My two cents aren't worth a dime.";	[reply] [d/l] [select]