You seek a performance improvement over your awk solution. For there to be an improvement, there must be room to improve. I think you mentioned in a follow-up comment that your list of names is in the order in which they will appear in the files you are parsing. This is useful, as you can save the small amount of time it might have taken to import the list of names into a hash. So lets do a little profiling:

use strict; use warnings; use Time::HiRes qw(time); open my $name_infh, '<', 'path/to/names/list' or die $!; open my $haystack_infh, '<', 'path/to/tab/delimited/list' or die $!; my $t0 = time(); while(<$name_infh>) {} while(<$haystack_infh>) {} printf "Elapsed time: %-.03f\n", time-$t0;

Now run that on your input file; the largest one you've got, and see how long it takes. If it takes too long, you can stop right there because there is no Perl (or any other language) solution that will meet your time requirements unless you change the requirements by processing streams more frequently, or overnight when it doesn't matter, etc.

If it is fast enough, then you could take the next step by implementing a solution in Perl that is similiarly linear in its computational complexity:

use strict; use warnings; open my $name_infh => '<', 'path/to/names/list' or die "Unable to open names list: $!\n"; open my $haystack_infh => '<', 'path/to/tab/del/file' or die "Unable to open haystack file: $!\n"; my $name = <$name_infh>; chomp $name; while (my $line = <$haystack_infh>) { my ($test_name, $payload) = split /\t/, $line, 2; if ($name eq $test_name) { print "We have a winner: $test_name => $payload"; $name = <$name_infh>; last if !defined $name; chomp $name; } }

This operates under the assumption that there will be exactly one match for each name in your list, and that your names list is in the correct order. If those assumptions are incorrect, then read your names list into a hash to start with; this will incur only a slight penalty -- so slight it's probably not worth maintaining your names list in any particular order to begin with. If it's not in order, just do this:

my %want; while(<$name_infh>) { chomp; $want{$_}++; } while (my $line = <$haystack_infh>) { my ($test_name, $payload) = split /\t/, $line, 2; if (exists $want{$test_name}) { delete $want{$test_name}; print "We have a winner: $test_name => $payload"; last if ! keys %want; } }

This last solution is still a linear time solution, as was the previous one, but is more flexible on the order in which things happen. It still makes one assumption; you're only looking for each name one time. You can remove the delete and the last if lines if that assumption isn't correct.

At any rate, if the initial profiling check determined that the sheer act of reading the files takes longer than you have, you'll have to come up with a different strategy that doesn't involve sitting around waiting for large files to load.


Dave


In reply to Re: Extacting lines where one column matches a name from a list of names by davido
in thread Extacting lines where one column matches a name from a list of names by mr_clean

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.