Re: Performance challenges

If the regex solution provided by Melly isn't as fast as you need, you might try:

open( IN, '<', "Your_data_here" );
open( GOOD, '>', "Good_file_here" );
open( BAD, '>', "Bad_file_here" );

while (<IN>) {
    my @row = split "\t", $_;
    if ( length($row[14]) == 5 && length($row[15]) == 7 ) {
        print GOOD $_;
        next;
    }
    print BAD $_;
}

close BAD;
close GOOD;
close IN;
[download]

Note: awk processing a million rows/minute probably isn't that bad. I'm not sure Perl is going to be much faster. This is a very I/O-bound activity.

My criteria for good software:

Does it work?
Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

Comment on Re: Performance challenges Download Code

Replies are listed 'Best First'.
Re^2: Performance challenges by Eimi Metamorphoumai (Deacon) on Mar 22, 2006 at 18:24 UTC
One suggestion: if each record has more than 16 fields, you might find slightly better performance with `my @row = split /\t/, $_, 17;` [download] which tells perl to split into at most 17 fields (0 to 15, leaving the trailing data in 16).	[reply] [d/l]
Re^2: Performance challenges by Anonymous Monk on Mar 22, 2006 at 13:31 UTC
Thanks very much folks! Much appreciate the fast response. dragonchild: I had a script similar to the one you wrote here; but I was not sure if that was the most optimal one. Sounds like it is. Thanks again! :) -Kris	[reply]