Oligo has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I'm trying to master the art of searching with hashes. I have a file with a list of search terms (one line = one term):

J00153:42:HC5NCBBXX:6:1101:10896:14959 J00153:42:HC5NCBBXX:6:1101:10896:14959 J00153:42:HC5NCBBXX:6:1101:26616:20709 J00153:42:HC5NCBBXX:6:1101:27549:19935

...and a master file I want to search for those terms in (again, one line per record):

J00153:42:HC5NCBBXX:6:1101:10896:14959 99 gnl|Btau_4.6.1|chr16 + 72729218 1 12M J00153:42:HC5NCBBXX:6:1101:27549:19935 83 gnl|Btau_4.6.1|chr8 + 49556412 1 7M

I started by opening the query file and reading each line into an array. Then while-ing through the master file, returning matching lines where the array elements match the relevant part of the master file:

# Open query file and read into array $queryfile = $ARGV[0]; open (QUERYFILE, $queryfile) or die "Cannot open query file\n"; @queries = <QUERYFILE>; close QUERYFILE; # Open main file $mainfile = $ARGV[1]; open (MAINFILE, $mainfile) or die "Cannot open searchable file\n"; # Search through main file while ($inline = <MAINFILE>) { @split = split /\t/, $inline; $ID = $split[0]; if (grep /$ID/, @queries) { print $inline; } else { } } exit;

This works fine, but the files are huge and the code takes an age to run. So, I tried converting the array to a hash (array elements = hash keys, values all = 1) but I can't seem to get the pattern matching syntax right; the code runs much faster but nothing comes back. So far I have:

# Open query file and read into array $queryfile = $ARGV[0]; open (QUERYFILE, $queryfile) or die "Cannot open query file\n"; @queries = <QUERYFILE>; close QUERYFILE; # Convert array to hash %hash = map {$_ => 1} @queries; # Open main file $mainfile = $ARGV[1]; open (MAINFILE, $mainfile) or die "Cannot open searchable file\n"; # Search through main file while ($inline = <MAINFILE>) { @split = split /\t/, $inline; $ID = $split[0]; if (defined $hash{$ID}) { print $inline; } else { } } exit;

Any Perly wisodom greatly appreciated!

Replies are listed 'Best First'.
Re: Hash searching
by Eily (Monsignor) on Aug 24, 2016 at 11:47 UTC

    The values you read from QUERYFILE contain the end of line characters, while your $ID does not, so $ID is not an exact match. You can try chomp @queries;, or maybe process all the input keys to make sure they match the expected format (ex: remove extra spaces, or have everything in uppercase).

    Although defined would work in this case (because the value is 1), exists is the standard way of checking the presence of a key in a hash.

Re: Hash searching
by kcott (Archbishop) on Aug 24, 2016 at 16:03 UTC

    G'day Oligo,

    Firstly, here's a script (pm_1170300_hash_search.pl) that does what you want. See the end of my post for an explanatory discussion of this code.

    #!/usr/bin/env perl use strict; use warnings; use autodie; my %search_terms; { open my $search_terms_fh, '<', $ARGV[0]; while (<$search_terms_fh>) { chomp; $search_terms{$_} = undef; } } { open my $master_file_fh, '<', $ARGV[1]; while (<$master_file_fh>) { my ($id, undef) = split ' ', $_, 2; print if exists $search_terms{$id}; } }

    I've used exactly the same search terms as you posted (including the two identical terms at the start):

    $ cat pm_1170300_search_terms.txt J00153:42:HC5NCBBXX:6:1101:10896:14959 J00153:42:HC5NCBBXX:6:1101:10896:14959 J00153:42:HC5NCBBXX:6:1101:26616:20709 J00153:42:HC5NCBBXX:6:1101:27549:19935

    The master file you posted is not particularly good for testing because all lines match. I've retained the data you posted (including the possibly erroneous space at the start of the first line). I've also added two more lines that won't match.

    $ cat pm_1170300_master_file.txt J00153:42:HC5NCBBXX:6:1101:10896:14959 99 gnl|Btau_4.6.1|chr16 + 72729218 1 12M J00153:42:HC5NCBBXX:6:1101:27549:19935 83 gnl|Btau_4.6.1|chr8 + 49556412 1 7M X00153:42:HC5NCBBXX:6:1101:10896:14959 99 gnl|Btau_4.6.1|chr16 + 72729218 1 12M X00153:42:HC5NCBBXX:6:1101:27549:19935 83 gnl|Btau_4.6.1|chr8 + 49556412 1 7M

    Here's the output:

    $ pm_1170300_hash_search.pl pm_1170300_search_terms.txt pm_1170300_mas +ter_file.txt J00153:42:HC5NCBBXX:6:1101:10896:14959 99 gnl|Btau_4.6.1|chr16 + 72729218 1 12M J00153:42:HC5NCBBXX:6:1101:27549:19935 83 gnl|Btau_4.6.1|chr8 + 49556412 1 7M

    And now the explanatory discussion.

    Your code suggests that there's foundation knowledge you do not yet possess: I strongly recommend you read "perlintro -- a brief introduction and overview of Perl".

    Always use the strict and warnings pragmata at the start of all your Perl code. perlintro explained why.

    While it's good that you've attempted to check I/O operations, your efforts highlight the fact that this can be error-prone. Your hand-crafted error messages neither identify the problem file nor the problem itself. Let Perl do this for you with the autodie pragma.

    You have used globally scoped, package variables throughout your code. Not only is this highly error-prone, but potential errors can prove difficult to track down. In short: don't do this! Use lexically scoped variables, typically declared with my, as discussed in perlintro.

    In my code, only %search_terms has global scope, because it is used by both input and output processing. The scope of all other variables is confined to their enclosing anonymous blocks. This has the added benefit of automatically closing filehandles when they go out of scope: another potential source of errors removed.

    Your only other post ("Extracting BLAST hits from a list of sequences") involved biological data. In this post you state: "the files are huge": I'm assuming this is also biological data. Accordingly, I've added a few additional features to help performance and to keep memory usage to a minimum.

    $search_terms{$_} = undef
    The usual idiom is "++$search_terms{$_}". In both cases, "exists $search_terms{$key}" can be used. The autoincrement is slightly slower than straight assignment (with a small number of search terms, this may well be unnoticable). As the actual value is immaterial, I've used undef; you could use some arbitrary value (0, 1, 42, etc.). Note: exists vs. defined.
    my ($id, undef) = ...
    We only want $id. Throw everything else away.
    ... = split ' ', $_, 2
    See split. The PATTERN ' ' is special; it handles any unwanted leading whitespace. Use of LIMIT allows Perl to allocate memory for split results at compile time; without this, memory allocation occurs at runtime.

    Benchmarking to show autoincrement is slightly, yet consistently, slower than straight assignment (in spoiler).

    Benchmark code (pm_1170300_bench_undef_hash_value.pl):

    #!/usr/bin/env perl use strict; use warnings; use Benchmark qw{cmpthese}; my @keys = 'aa' .. 'zz'; cmpthese 1e5 => { a_inc => sub { my %hash; ++$hash{$_} for @keys }, undef => sub { my %hash; $hash{$_} = undef for @keys }, val_0 => sub { my %hash; $hash{$_} = 0 for @keys }, val_1 => sub { my %hash; $hash{$_} = 1 for @keys }, val42 => sub { my %hash; $hash{$_} = 42 for @keys }, };

    Benchmark results:

    $ pm_1170300_bench_undef_hash_value.pl; pm_1170300_bench_undef_hash_va +lue.pl; pm_1170300_bench_undef_hash_value.pl; pm_1170300_bench_undef_ +hash_value.pl; pm_1170300_bench_undef_hash_value.pl Rate a_inc val_0 undef val42 val_1 a_inc 4129/s -- -1% -1% -1% -2% val_0 4151/s 1% -- -1% -1% -1% undef 4188/s 1% 1% -- -0% -0% val42 4191/s 2% 1% 0% -- -0% val_1 4198/s 2% 1% 0% 0% -- Rate a_inc undef val_1 val42 val_0 a_inc 4193/s -- -2% -2% -2% -2% undef 4272/s 2% -- -0% -0% -0% val_1 4275/s 2% 0% -- -0% -0% val42 4290/s 2% 0% 0% -- -0% val_0 4290/s 2% 0% 0% 0% -- Rate a_inc val_0 val_1 undef val42 a_inc 4228/s -- -1% -1% -1% -2% val_0 4277/s 1% -- -0% -0% -1% val_1 4292/s 2% 0% -- 0% -0% undef 4292/s 2% 0% 0% -- -0% val42 4301/s 2% 1% 0% 0% -- Rate a_inc val_0 val_1 val42 undef a_inc 4174/s -- -1% -2% -2% -2% val_0 4207/s 1% -- -1% -1% -1% val_1 4239/s 2% 1% -- -0% -0% val42 4243/s 2% 1% 0% -- -0% undef 4243/s 2% 1% 0% 0% -- Rate a_inc val_0 val_1 undef val42 a_inc 4239/s -- -1% -2% -2% -2% val_0 4279/s 1% -- -1% -1% -1% val_1 4305/s 2% 1% -- -0% -0% undef 4308/s 2% 1% 0% -- -0% val42 4316/s 2% 1% 0% 0% --

    Benchmarking to show that throwing away unwanted split results is faster, and that using LIMIT is also faster (in spoiler).

    Benchmark code (pm_1170300_bench_split_limit.pl):

    #!/usr/bin/env perl use strict; use warnings; use Benchmark qw{cmpthese}; my $string = ' J00153:42:HC5NCBBXX:6:1101:10896:14959 99 gnl|Btau_4 +.6.1|chr16 72729218 1 12M'; cmpthese 1e7 => { limit_undef => sub { my ($id, undef) = split ' ', $string, 2 }, nolim_undef => sub { my ($id, undef) = split ' ', $string }, limit_array => sub { my ($id, @rest) = split ' ', $string, 2 }, nolim_array => sub { my ($id, @rest) = split ' ', $string }, };

    Benchmark results:

    $ pm_1170300_bench_split_limit.pl; pm_1170300_bench_split_limit.pl; pm +_1170300_bench_split_limit.pl; pm_1170300_bench_split_limit.pl; pm_11 +70300_bench_split_limit.pl Rate nolim_array limit_array nolim_undef limit_undef nolim_array 346500/s -- -68% -68% -77% limit_array 1071811/s 209% -- -2% -29% nolim_undef 1090513/s 215% 2% -- -27% limit_undef 1499250/s 333% 40% 37% -- Rate nolim_array limit_array nolim_undef limit_undef nolim_array 355492/s -- -64% -68% -76% limit_array 1000000/s 181% -- -10% -32% nolim_undef 1107420/s 212% 11% -- -24% limit_undef 1461988/s 311% 46% 32% -- Rate nolim_array limit_array nolim_undef limit_undef nolim_array 346981/s -- -68% -69% -76% limit_array 1096491/s 216% -- -1% -24% nolim_undef 1111111/s 220% 1% -- -23% limit_undef 1445087/s 316% 32% 30% -- Rate nolim_array nolim_undef limit_array limit_undef nolim_array 359195/s -- -67% -67% -76% nolim_undef 1089325/s 203% -- -0% -27% limit_array 1092896/s 204% 0% -- -27% limit_undef 1492537/s 316% 37% 37% -- Rate nolim_array nolim_undef limit_array limit_undef nolim_array 362319/s -- -68% -68% -76% nolim_undef 1114827/s 208% -- -1% -26% limit_array 1127396/s 211% 1% -- -25% limit_undef 1506024/s 316% 35% 34% --

    — Ken

      Most of the difference between limit_undef and nolim_undef actually comes from the fact that the undef in the list makes perl split an additional time just to throw the extra field away. Because if the list on the left side has a fixed length, split will be called with an implicit limit of length+1. So my ($id, undef) = split ' ', $string; is actually my ($id, undef) = split ' ', $string, 3;. The benchmark rewritten as:

      use strict; use warnings; use Benchmark qw{cmpthese}; my $string = ' J00153:42:HC5NCBBXX:6:1101:10896:14959 99 gnl|Btau_4 +.6.1|chr16 72729218 1 12M'; cmpthese 1e7 => { limit_undef => sub { my ($id,) = split ' ', $string, 2 }, nolim_undef => sub { my ($id,) = split ' ', $string }, limit_array => sub { my ($id, @rest) = split ' ', $string, 2 }, nolim_array => sub { my ($id, @rest) = split ' ', $string }, };
      gives the result:
      Rate nolim_array limit_array limit_undef nolim_undef nolim_array 573888/s -- -59% -67% -67% limit_array 1396453/s 143% -- -20% -21% limit_undef 1746725/s 204% 25% -- -1% nolim_undef 1760873/s 207% 26% 1% --
      So you can ommit the limit when splitting to a fixed-size list.

        ++ Thanks for the additional information and benchmarks.

        — Ken

      This is fantastic, thank you so much. I will endeavour to be less sloppy in my pragmata in future!

        "This is fantastic, thank you so much."

        You're welcome.

        "I will endeavour to be less sloppy in my pragmata in future!"

        That made me laugh. A candidate for your signature, perhaps? :-)

        — Ken