Re: Extacting lines where one column matches a name from a list of names

You seek a performance improvement over your awk solution. For there to be an improvement, there must be room to improve. I think you mentioned in a follow-up comment that your list of names is in the order in which they will appear in the files you are parsing. This is useful, as you can save the small amount of time it might have taken to import the list of names into a hash. So lets do a little profiling:

use strict;
use warnings;
use Time::HiRes qw(time);

open my $name_infh, '<', 'path/to/names/list' or die $!;
open my $haystack_infh, '<', 'path/to/tab/delimited/list' or die $!;

my $t0 = time();

while(<$name_infh>) {}
while(<$haystack_infh>) {}

printf "Elapsed time: %-.03f\n", time-$t0;
[download]

Now run that on your input file; the largest one you've got, and see how long it takes. If it takes too long, you can stop right there because there is no Perl (or any other language) solution that will meet your time requirements unless you change the requirements by processing streams more frequently, or overnight when it doesn't matter, etc.

If it is fast enough, then you could take the next step by implementing a solution in Perl that is similiarly linear in its computational complexity:

use strict;
use warnings;

open my $name_infh     => '<', 'path/to/names/list'
    or die "Unable to open names list: $!\n";
open my $haystack_infh => '<', 'path/to/tab/del/file'
    or die "Unable to open haystack file: $!\n";

my $name = <$name_infh>;
chomp $name;

while (my $line = <$haystack_infh>) {
    my ($test_name, $payload) = split /\t/, $line, 2;
    if ($name eq $test_name) {
        print "We have a winner: $test_name => $payload";
        $name = <$name_infh>;
        last if !defined $name;
        chomp $name;
    }
}
[download]

This operates under the assumption that there will be exactly one match for each name in your list, and that your names list is in the correct order. If those assumptions are incorrect, then read your names list into a hash to start with; this will incur only a slight penalty -- so slight it's probably not worth maintaining your names list in any particular order to begin with. If it's not in order, just do this:

my %want;
while(<$name_infh>) {
   chomp;
   $want{$_}++;
}
while (my $line = <$haystack_infh>) {
    my ($test_name, $payload) = split /\t/, $line, 2;
    if (exists $want{$test_name}) {
        delete $want{$test_name};
        print "We have a winner: $test_name => $payload";
        last if ! keys %want;
    }
}
[download]

This last solution is still a linear time solution, as was the previous one, but is more flexible on the order in which things happen. It still makes one assumption; you're only looking for each name one time. You can remove the delete and the last if lines if that assumption isn't correct.

At any rate, if the initial profiling check determined that the sheer act of reading the files takes longer than you have, you'll have to come up with a different strategy that doesn't involve sitting around waiting for large files to load.

Dave

Comment on Re: Extacting lines where one column matches a name from a list of names Select or Download Code