comment on

G'day Oligo,

Firstly, here's a script (pm_1170300_hash_search.pl) that does what you want. See the end of my post for an explanatory discussion of this code.

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;

my %search_terms;

{
    open my $search_terms_fh, '<', $ARGV[0];

    while (<$search_terms_fh>) {
        chomp;
        $search_terms{$_} = undef;
    }
} 

{
    open my $master_file_fh, '<', $ARGV[1];

    while (<$master_file_fh>) {
        my ($id, undef) = split ' ', $_, 2;
        print if exists $search_terms{$id};
    } 
}
[download]

I've used exactly the same search terms as you posted (including the two identical terms at the start):

$ cat pm_1170300_search_terms.txt
J00153:42:HC5NCBBXX:6:1101:10896:14959
J00153:42:HC5NCBBXX:6:1101:10896:14959
J00153:42:HC5NCBBXX:6:1101:26616:20709
J00153:42:HC5NCBBXX:6:1101:27549:19935
[download]

The master file you posted is not particularly good for testing because all lines match. I've retained the data you posted (including the possibly erroneous space at the start of the first line). I've also added two more lines that won't match.

$ cat pm_1170300_master_file.txt
 J00153:42:HC5NCBBXX:6:1101:10896:14959    99    gnl|Btau_4.6.1|chr16 
+   72729218    1    12M
J00153:42:HC5NCBBXX:6:1101:27549:19935    83    gnl|Btau_4.6.1|chr8   
+ 49556412    1    7M
 X00153:42:HC5NCBBXX:6:1101:10896:14959    99    gnl|Btau_4.6.1|chr16 
+   72729218    1    12M
X00153:42:HC5NCBBXX:6:1101:27549:19935    83    gnl|Btau_4.6.1|chr8   
+ 49556412    1    7M
[download]

Here's the output:

$ pm_1170300_hash_search.pl pm_1170300_search_terms.txt pm_1170300_mas
+ter_file.txt
 J00153:42:HC5NCBBXX:6:1101:10896:14959    99    gnl|Btau_4.6.1|chr16 
+   72729218    1    12M
J00153:42:HC5NCBBXX:6:1101:27549:19935    83    gnl|Btau_4.6.1|chr8   
+ 49556412    1    7M
[download]

And now the explanatory discussion.

Your code suggests that there's foundation knowledge you do not yet possess: I strongly recommend you read "perlintro -- a brief introduction and overview of Perl".

Always use the strict and warnings pragmata at the start of all your Perl code. perlintro explained why.

While it's good that you've attempted to check I/O operations, your efforts highlight the fact that this can be error-prone. Your hand-crafted error messages neither identify the problem file nor the problem itself. Let Perl do this for you with the autodie pragma.

You have used globally scoped, package variables throughout your code. Not only is this highly error-prone, but potential errors can prove difficult to track down. In short: don't do this! Use lexically scoped variables, typically declared with my, as discussed in perlintro.

In my code, only %search_terms has global scope, because it is used by both input and output processing. The scope of all other variables is confined to their enclosing anonymous blocks. This has the added benefit of automatically closing filehandles when they go out of scope: another potential source of errors removed.

Your only other post ("Extracting BLAST hits from a list of sequences") involved biological data. In this post you state: "the files are huge": I'm assuming this is also biological data. Accordingly, I've added a few additional features to help performance and to keep memory usage to a minimum.

$search_terms{$_} = undef: The usual idiom is "++$search_terms{$_}". In both cases, "exists $search_terms{$key}" can be used. The autoincrement is slightly slower than straight assignment (with a small number of search terms, this may well be unnoticable). As the actual value is immaterial, I've used undef; you could use some arbitrary value (0, 1, 42, etc.). Note: exists vs. defined.
my ($id, undef) = ...: We only want $id. Throw everything else away.
... = split ' ', $_, 2: See split. The PATTERN ' ' is special; it handles any unwanted leading whitespace. Use of LIMIT allows Perl to allocate memory for split results at compile time; without this, memory allocation occurs at runtime.

Benchmarking to show autoincrement is slightly, yet consistently, slower than straight assignment (in spoiler).

Benchmarking to show that throwing away unwanted split results is faster, and that using LIMIT is also faster (in spoiler).

— Ken

In reply to Re: Hash searching by kcott
in thread Hash searching by Oligo

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.