Re: Searching Huge files

A large part of the trick is to only make one pass through each file and to use a lookup in some in-memory structure to perform the test. If your data gets too big for that to work then you need to think about using a database, but for only a few million entries a hash works well. Consider:

use strict;
use warnings;
 
my $snpFile = <<SNP;
snp_rs              log_1_pval
rs3749375    11.7268615355335
rs10499549    10.4656064706897
rs7837688    9.85374546064131
rs4794737    9.41576680248523
rs10033399    9.36407447191822
rs4242382    9.22809709356544
rs4242384    8.91767075801336
rs9656816    8.61480602028324
rs982354    8.40833878650415
rs31226            8.38047936810042
SNP

my $mapFile = <<MAP;
rs10904494    NP_817124    17881
rs7837688    NP_817124    39800
rs4881551    ZMYND11    21567
rs7909028    ZMYND11    5335
rs10499549    ZMYND11    0
rs12779173    ZMYND11    0
rs2448370    ZMYND11    0
rs2448366    ZMYND11    0
rs2379078    ZMYND11    0
rs3749375    ZMYND11    0
MAP

my %snpLookup;

# Populate the hash
open my $snpIn, '<', \$snpFile or die "Can't open snp file: $!";
/^(\w+)\s+(\d+\.\d+)/ and $snpLookup{$1} = $2 while <$snpIn>;
close $snpIn;

# Perform the search and report
open my $mapIn, '<', \$mapFile or die "Can't open map file: $!";
/^(\w+)\s+(\w+)/ and exists $snpLookup{$1} and print "$1    $2\n"
    while <$mapIn>;
close $mapIn;
[download]

Prints:

rs7837688    NP_817124
rs10499549    ZMYND11
rs3749375    ZMYND11
[download]

The output is in a different order than specified in the OP. If the order is important there are various ways of fixing the problem with very little extra performance cost.

Perl is environmentally friendly - it saves trees

Comment on Re: Searching Huge files Select or Download Code

Replies are listed 'Best First'.
Re^3: Searching Huge files by biomonk (Acolyte) on Jul 08, 2008 at 04:20 UTC
Hi GrandFather, can you please explain in more detail about the flow of program as i come from biology background and its my first Perl program.This logic is very helpful to me as it can be used number of time in my work, so i want to know about it rather than copying code and also please guide me in populating hash from a file , searching through a file. Thanks a lot.	[reply]
Re^4: Searching Huge files by GrandFather (Saint) on Jul 08, 2008 at 04:46 UTC
`open my $snpIn, '<', \$snpFile` uses a variable as though it were a file. It's a useful trick for test code because you don't need a separate file. Simply replace the `\$snpFile` bit with the file name you would normally use with the open. The code that populates the hash uses a couple of tricks so that it is compact. Expanded it might look like: `while (<$snpIn>) { next unless /^(\w+)\s+(\d+\.\d+)/; $snpLookup{$1} = $2; }` [download] Note that the original code used while as a statement modifier and the code above uses unless as a statement modifier. Note too that the value given on the input line is the value associated with the key (the first 'word' on the line). You could instead assign $. which would give the line number, or you could `++$snpLookup{$1}` instead which would give a count of the number of entries for that 'word' in the file. In like fashion the search loop can be expanded: `while (<$mapIn>) { next unless /^(\w+)\s+(\w+)/ and exists $snpLookup{$1}; print "$1 $2\n"; }` [download] The important test is `exists $snpLookup{$1}` which tests to see if the first 'word' on the line was also a first 'word' in the first file using exists. The test is only made if the regular expression succeeds. Using the regular expression in that way avoids possible nastiness at the end of the file and maybe where the file format is not as you expect. See perlretut and perlre for more about regular expressions. Perl is environmentally friendly - it saves trees	[reply] [d/l] [select]
Re^5: Searching Huge files by biomonk (Acolyte) on Jul 08, 2008 at 13:06 UTC
Thank you very much GrandFather, for taking out your time for me and writing such a detail explanation.	[reply]
Re^5: Searching Huge files by biomonk (Acolyte) on Jul 09, 2008 at 20:36 UTC
Hello GrandFather , i have a new problem now , i need the score from the snp file(first file), now my output should something like this. `rs7837688 NP_817124 9.85374546064131 rs10499549 ZMYND11 10.4656064706897 rs3749375 ZMYND11 11.7268615355335` [download] I'm confused can you help me please. Thank you in advance.	[reply] [d/l]
Re^6: Searching Huge files by GrandFather (Saint) on Jul 09, 2008 at 21:27 UTC
Re^7: Searching Huge files by biomonk (Acolyte) on Jul 11, 2008 at 19:50 UTC
Some notes below your chosen depth have not been shown here
Re^2: Searching Huge files by biomonk (Acolyte) on Jul 08, 2008 at 03:32 UTC
Thank a lot your replies, those helped me a lot, especially GrandFather you rock.	[reply]