Performance challenges

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am a Perl newbie and unsure if perl or awk is the right solution for my problem. Any advice will be much appreciated.

I am working on a tab-delimtied ASCII file with about 20 million records. Fields 15 and 16 of each record must be of fixed-length:
Field 15 -> always 5 characters
Field 16 -> always 7 characters

But there are a few bad records that don't meet this condition. My task is to filter these bad records into a separate file.

What is the most optimal way of determining this?
Note: When I ran a simple query (using awk) to find a specific primary key it took over 20 minutes to find the record.

-Kris

Comment on Performance challenges

Replies are listed 'Best First'.
Re: Performance challenges by dragonchild (Archbishop) on Mar 22, 2006 at 12:21 UTC
If the regex solution provided by Melly isn't as fast as you need, you might try: `open( IN, '<', "Your_data_here" ); open( GOOD, '>', "Good_file_here" ); open( BAD, '>', "Bad_file_here" ); while (<IN>) { my @row = split "\t", $_; if ( length($row[14]) == 5 && length($row[15]) == 7 ) { print GOOD $_; next; } print BAD $_; } close BAD; close GOOD; close IN;` [download] Note: awk processing a million rows/minute probably isn't that bad. I'm not sure Perl is going to be much faster. This is a very I/O-bound activity. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply] [d/l]
Re^2: Performance challenges by Eimi Metamorphoumai (Deacon) on Mar 22, 2006 at 18:24 UTC
One suggestion: if each record has more than 16 fields, you might find slightly better performance with `my @row = split /\t/, $_, 17;` [download] which tells perl to split into at most 17 fields (0 to 15, leaving the trailing data in 16).	[reply] [d/l]
Re^2: Performance challenges by Anonymous Monk on Mar 22, 2006 at 13:31 UTC
Thanks very much folks! Much appreciate the fast response. dragonchild: I had a script similar to the one you wrote here; but I was not sure if that was the most optimal one. Sounds like it is. Thanks again! :) -Kris	[reply]
Re: Performance challenges by Melly (Chaplain) on Mar 22, 2006 at 12:09 UTC
Untested: `open(INPUT, 'original.txt'); open(GOOD, ">good.txt"); open(BAD, ">bad.txt"); while(<INPUT>){ if(/^([^\t]*\t){15}[^\t]{5}(\t)[^\t]{7}(\t)/){ print GOOD; } else{ print BAD; } } close GOOD; close BAD;` [download] Tom Melly, tom@tomandlu.co.uk	[reply] [d/l]
Re^2: Performance challenges by salva (Canon) on Mar 22, 2006 at 12:17 UTC
it will be much faster if non-capturing groups are used in the regexp: `/^(?:[^\t]*\t){15}[^\t]{5}\t[^\t]{7}\t/` [download]	[reply] [d/l]