Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am a Perl newbie and unsure if perl or awk is the right solution for my problem. Any advice will be much appreciated.

I am working on a tab-delimtied ASCII file with about 20 million records. Fields 15 and 16 of each record must be of fixed-length:
Field 15 -> always 5 characters
Field 16 -> always 7 characters

But there are a few bad records that don't meet this condition. My task is to filter these bad records into a separate file.

What is the most optimal way of determining this?
Note: When I ran a simple query (using awk) to find a specific primary key it took over 20 minutes to find the record.

-Kris

Replies are listed 'Best First'.
Re: Performance challenges
by dragonchild (Archbishop) on Mar 22, 2006 at 12:21 UTC
    If the regex solution provided by Melly isn't as fast as you need, you might try:
    open( IN, '<', "Your_data_here" ); open( GOOD, '>', "Good_file_here" ); open( BAD, '>', "Bad_file_here" ); while (<IN>) { my @row = split "\t", $_; if ( length($row[14]) == 5 && length($row[15]) == 7 ) { print GOOD $_; next; } print BAD $_; } close BAD; close GOOD; close IN;
    Note: awk processing a million rows/minute probably isn't that bad. I'm not sure Perl is going to be much faster. This is a very I/O-bound activity.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      One suggestion: if each record has more than 16 fields, you might find slightly better performance with
      my @row = split /\t/, $_, 17;
      which tells perl to split into at most 17 fields (0 to 15, leaving the trailing data in 16).
      Thanks very much folks! Much appreciate the fast response.

      dragonchild: I had a script similar to the one you wrote here; but I was not sure if that was the most optimal one. Sounds like it is. Thanks again! :) -Kris
Re: Performance challenges
by Melly (Chaplain) on Mar 22, 2006 at 12:09 UTC

    Untested:

    open(INPUT, 'original.txt'); open(GOOD, ">good.txt"); open(BAD, ">bad.txt"); while(<INPUT>){ if(/^([^\t]*\t){15}[^\t]{5}(\t)[^\t]{7}(\t)/){ print GOOD; } else{ print BAD; } } close GOOD; close BAD;
    Tom Melly, tom@tomandlu.co.uk
      it will be much faster if non-capturing groups are used in the regexp:
      /^(?:[^\t]*\t){15}[^\t]{5}\t[^\t]{7}\t/