pboin has asked for the wisdom of the Perl Monks concerning the following question:
I've got a rather large (327M) file of multiple line records to process. I'm picking up each record, pulling the key out, and then looking into a hash to see if it's a 'keeper' or not.
Records look like this:
##REPORT 01 A ARCUR1GMU# 00112106 F N ARCUR1 ARCU +R1 P 02 AR CURRENCY CONVERSION CONTROL - (BY PTR) + N 03 FROM FVENDAP1 04 CODE: 10 RECORDING & REPORTING GMU / NON-WHQ ONLY RE +PORTS 07 00112704 00200103 11 R 3 RD.RDSMODEL C 00 +08 0001 12 A 0007 000 000 00 +0 000 13 VDLTP02 14 S N VDLTB02 ##REPORT 01 A ARCUR10000 00112106 F N ARCUR1 ARCU +R1 P 02 AR CURRENCY CONVERSION CONTROL - (BY PTR) + N 07 00112704 00200103 11 R 3 RD.RDSMODEL C 00 +08 0001 12 A 0007 000 000 00 +0 000 13 VDLTP02 14 S N VDLTB02
I've got a regex that technically works, but the performance is very poor (over 1 sec. per record). I wondered about slurping the whole file in, but I'm not swaping at all, so that 327M is all in memory -- shouldn't be a problem, right?. My best hunch is that the regex is smarter than I am, and it is doing some heavy-duty backtracking that I'm not understanding.
I think the regex is starting to capture at double-hashes. Then, continue non-greedy matching anything (including newline), until a positive-lookahead of either a) more double-hashes, denoting a new record or b) EOF.
My code looks like this:
#!/usr/bin/perl use strict; # input file of ##REPORT cards my $str = do {local $/ = undef; <STDIN>}; # keepers my %keeplist; open (RIDS, '<', 'rids_that_have_versions.txt') or die; while (<RIDS>) { chomp; $keeplist{$_} = 1;} my @cards; while ( $str =~ /(##.*?)(?=(##|\Z))/gs ) { my $rid = substr($1, 19, 10); print "$rid: " . ($keeplist{$rid} ? 'y' : 'n') . "\n"; }
Update:
This is definitely a non-linear problem...
| Lines | Seconds |
| 20k | 7 s. |
| 40k | 25 s. |
| 80k | 102 s. |
Another Update:
Prompted by some insightful responses, I decided to minimize my dependence on regex in this case. I decided to buffer the file manually, and got the 80k line test time down to just over 1.2 seconds.
while (<STDIN>) { if ( /^##/ ) { my $rid = substr($buffer, 19, 10); $rid =~ s/\s*$//g; print $buffer if $keeplist{$rid}; $buffer = ''; } $buffer .= $_; }
YAU:
The performance problem w/ the regex is directly addressed and explained quite well by TheDamian in his must-read book Perl Best Practices. I almost want to keep the secret 'cause I feel so strongly that every perl programer should have this book. But... The extenstive tracking and back-tracking from .* seems to be the problem. Read about it in 'Unconstrained Repetitions' on p. 250.
Thanks for taking a look...
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Multi-line Regex Performance
by BrowserUk (Patriarch) on Nov 01, 2005 at 15:52 UTC | |
|
Re: Multi-line Regex Performance
by japhy (Canon) on Nov 01, 2005 at 15:45 UTC | |
|
Re: Multi-line Regex Performance
by japhy (Canon) on Nov 01, 2005 at 16:00 UTC | |
|
Re: Multi-line Regex Performance
by ikegami (Patriarch) on Nov 01, 2005 at 15:48 UTC | |
|
Re: Multi-line Regex Performance
by sauoq (Abbot) on Nov 01, 2005 at 15:45 UTC | |
by pboin (Deacon) on Nov 01, 2005 at 15:52 UTC | |
by sauoq (Abbot) on Nov 01, 2005 at 16:27 UTC |