in reply to Performance challenges

If the regex solution provided by Melly isn't as fast as you need, you might try:
open( IN, '<', "Your_data_here" ); open( GOOD, '>', "Good_file_here" ); open( BAD, '>', "Bad_file_here" ); while (<IN>) { my @row = split "\t", $_; if ( length($row[14]) == 5 && length($row[15]) == 7 ) { print GOOD $_; next; } print BAD $_; } close BAD; close GOOD; close IN;
Note: awk processing a million rows/minute probably isn't that bad. I'm not sure Perl is going to be much faster. This is a very I/O-bound activity.

My criteria for good software:
  1. Does it work?
  2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

Replies are listed 'Best First'.
Re^2: Performance challenges
by Eimi Metamorphoumai (Deacon) on Mar 22, 2006 at 18:24 UTC
    One suggestion: if each record has more than 16 fields, you might find slightly better performance with
    my @row = split /\t/, $_, 17;
    which tells perl to split into at most 17 fields (0 to 15, leaving the trailing data in 16).
Re^2: Performance challenges
by Anonymous Monk on Mar 22, 2006 at 13:31 UTC
    Thanks very much folks! Much appreciate the fast response.

    dragonchild: I had a script similar to the one you wrote here; but I was not sure if that was the most optimal one. Sounds like it is. Thanks again! :) -Kris