Fast file parsing

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Fast file parsing by BrowserUk (Patriarch) on Mar 05, 2004 at 18:34 UTC
The first thing to do is drop all the /g options on your regexes. You only need to know if it exists, not if it exists more than once, unless you are going to bother to do something with the latter information which you aren't currently. That could save some processing. You could save some more time by not performing the checks for "Legal" or "Tabloid" if you already found "Letter". The same for the other catagories, That ought to cut the processing time by around half (guess!!) If you order the various types by the most frequent usage, it might save a bit more. Finally, if you have 60+ MB of ram to spare, you might save some time by slurping the file into a scalar and then running your regexes against that. If you do this, make sure that you don't use the /g option or apply more regexes than you need to. (ie. No Duplex of you already found Simplex etc.) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l]
Re: Fast file parsing by arden (Curate) on Mar 05, 2004 at 18:36 UTC
My first thought is, after you've answered all of your questions (#pages, simplex/duplex, pagesize, copies, title), why do you keep looking through the file? Why not do something like `my $more = 5; while ( $more && <FILE> ) { . . . # $more-- with each successful assignment so $more will be 0 when all +five variables are known . . . }` [download] If I remember correctly, everything you need from your postscipt file is in the first few lines right? - - arden.	[reply] [d/l]
Re: Re: Fast file parsing by perrin (Chancellor) on Mar 05, 2004 at 20:01 UTC
You could also exit the while loop with a "last" statement when you get them all. Might be easier to read.	[reply]
Re: Fast file parsing by kvale (Monsignor) on Mar 05, 2004 at 18:43 UTC
Since you are just looking for DSC lines, I would speed things up by skipping the loop unless it is one of those statements: `while (<FILE>){ next unless /^%%/; ... }` [download] -Mark	[reply] [d/l]
Re: Fast file parsing by matija (Priest) on Mar 05, 2004 at 18:48 UTC
I don't know enough about printing setups to see if the information could be obtained elsewhere, so let's see if we can speed up the parsing. I don't have any example files to see if my assumptions are sound, so I will try to make them explicit, and you can discard the advice that is rendered invalid by the incorrect ones. First of all, I assume that the "global" data (duplex/simplex, pagesize, document title) only appears at the start of the file, and not somewhere in the middle. So you could wrap those tests with: `unless ($pagestate && $pagesize && defined($title)) { # your tests }` [download] That way you skip almost half the checks for most of the file. That leaves checks for number of pages, number of copies (oops, maybe that fits into the first section: the ones that only appear at the start). Since everything (except the `QTY=(\d+)` match for the number of copies) starts with a postscript comment, you can wrap all those tests with `if (/^%%/) { # your checks }` [download] Thus skipping all those lines which aren't PS comments, and thus can't contain any data of interest to your script. In the worst case, only the `QTY=(\d+)` check needs to look at all the lines of the file, and if that's a header notation, not even that. Oh, and one more thing:Unless the last value for pages you find in the file is the applicable one, you might put `last if (defined($doctitle) && $copies && $pagesize && $pagestate);` [download] With that, the code will stop reading the file as soon as it has all the data it needs.	[reply] [d/l] [select]
Re: Fast file parsing by Vautrin (Hermit) on Mar 05, 2004 at 18:42 UTC
If I were you, I would benchmark my code on some corpus of data (preferably as likely to what I would use in production as possible). Check out: Benchmark::Timer from CPAN, or do a search for benchmarking. Then, when you find the slow sections of your algorithm, either optimize them in Perl, or use a module like Inline::CPP to speed up your code. (There are a bunch of inline modules in case you want to inline C, Assembler, Java, or whatever) Want to support the EFF and FSF by buying cool stuff? Click here.	[reply]
Re: Fast file parsing by Anonymous Monk on Mar 05, 2004 at 21:10 UTC
Hey wow! Thanks for help. I appreciate the suggestions/tips a great deal. I took most of the advice offered up here, and came up with the following: open FILE, "$ARGV[0]"; while (<FILE>){ if (/^%%EndSetup./){ $doctitle = "Unknown Document Title" unless defined($doctitle) +; $pagestate = "Simplex" unless defined($pagestate); $pagesize = "Letter" unless defined($pagesize); $copies = 1 unless defined($copies); } + + # Document Title unless (defined($doctitle)){ chomp ($doctitle = $1) if (/^%%Title: (.)$/); } + + # Duplex or no... unless (defined($pagestate)){ $pagestate = "Duplex" if (/^%%.* \Duplex Duplex./ \|\| /^\[&l1 +S/i \|\| /^\[&l2S/i \|\| /DUPLEX=ON/); $pagestate = "Simplex" if (/^%%.* \Duplex None./ \|\| /^\[&l1S +/i \|\| /DUPLEX=OFF/); } + + # Page Size unless (defined($pagesize)){ $pagesize = "Letter" if (/^%%.* Letter/ \|\| /^\[&l2A/i \|\| /PAPE +R=LETTER/); $pagesize = "Legal" if (/^%%.* Legal/ \|\| /^\[L3A/i \|\| /PAPER=L +EGAL/); $pagesize = "Tabloid" if (/^%%.* Tabloid/ \|\| /^\[&l2000A/i \|\| +/PAPER=LEDGER/); } + + # Number of copies unless (defined($copies)){ $copies = $1 if (/^%%.* numcopies (\d+)./i \|\| /.QTY=(\d+)./ +); } + + # Number of Pages unless (defined($pages)){ $pages = $1 if (/^%%Pages:\s+(\d+)./); } } close FILE; [download] I got parse times down from 90 seconds to 1-2 seconds on a 42MB test file. I appreciate the help a great deal. Y'all are awesome. Louie	[reply] [d/l]
Re: Fast file parsing by perrin (Chancellor) on Mar 05, 2004 at 20:26 UTC
It's faster to use index() when looking for exact matches. For example, I would change your first section to this: `# Only care about comments next unless ( index( $_, '%%' ) == 0 ); # Duplex or no... if (( index( $_, '%%' ) == 0 && index( $_, 'Duplex Duplex' > -1 )) \|\| /^[&l1S/i \|\| /^[&l2S/i \|\| index( $_, 'DUPLEX=ON') > -1 ) { $pagestate = 'Duplex'; }` [download] You should make sure that the most common situations are the ones that come first in that list of \|\| conditions. Also, for large files it can be helpful to read in chunks bigger than one line. I found this node sped things up a lot when I was doing something similar.	[reply] [d/l]
Re: Re: Fast file parsing by hv (Prior) on Mar 09, 2004 at 17:43 UTC
It's faster to use index() when looking for exact matches. That ain't necessarily so, particularly when anchored: `/^%%/` only needs to check the start of the string, whereas `index($_, '%%')` needs to scan to the first match, possibly through the entire string. Update - of course using `substr` and `eq` is a much more obvious way to check this, and I've added some options to the code below to reflect that./Update Try this: #!/usr/bin/perl -w use Benchmark qw/ cmpthese /; my $count = shift; our $a = '%%'; our $b = ' ' x 10000; cmpthese($count, { first_re => q{ $match = "$a$b" =~ /^%%/ }, last_re => q{ $match = "$b$a" =~ /^%%/ }, miss_re => q{ $match = "$b" =~ /^%%/ }, first_index => q{ $match = index("$a$b", "%%") == 0}, last_index => q{ $match = index("$b$a", "%%") == 0}, miss_index => q{ $match = index("$b" , "%%") == 0}, first_substr => q{ $match = substr("$a$b", 0, 2) eq '%%' }, last_substr => q{ $match = substr("$b$a", 0, 2) eq '%%' }, miss_substr => q{ $match = substr("$b" , 0, 2) eq '%%' }, }) [download] Hugo	[reply] [d/l] [select]
Re: Re: Re: Fast file parsing by perrin (Chancellor) on Mar 09, 2004 at 18:15 UTC
Good catch! The anchor makes all the difference, and the regex even beats substr in this case when I run it on my machine. So, index is better than regex for unanchored matches but much worse for anchored ones.	[reply]