in reply to Searching Huge files

A large part of the trick is to only make one pass through each file and to use a lookup in some in-memory structure to perform the test. If your data gets too big for that to work then you need to think about using a database, but for only a few million entries a hash works well. Consider:

use strict; use warnings; my $snpFile = <<SNP; snp_rs log_1_pval rs3749375 11.7268615355335 rs10499549 10.4656064706897 rs7837688 9.85374546064131 rs4794737 9.41576680248523 rs10033399 9.36407447191822 rs4242382 9.22809709356544 rs4242384 8.91767075801336 rs9656816 8.61480602028324 rs982354 8.40833878650415 rs31226 8.38047936810042 SNP my $mapFile = <<MAP; rs10904494 NP_817124 17881 rs7837688 NP_817124 39800 rs4881551 ZMYND11 21567 rs7909028 ZMYND11 5335 rs10499549 ZMYND11 0 rs12779173 ZMYND11 0 rs2448370 ZMYND11 0 rs2448366 ZMYND11 0 rs2379078 ZMYND11 0 rs3749375 ZMYND11 0 MAP my %snpLookup; # Populate the hash open my $snpIn, '<', \$snpFile or die "Can't open snp file: $!"; /^(\w+)\s+(\d+\.\d+)/ and $snpLookup{$1} = $2 while <$snpIn>; close $snpIn; # Perform the search and report open my $mapIn, '<', \$mapFile or die "Can't open map file: $!"; /^(\w+)\s+(\w+)/ and exists $snpLookup{$1} and print "$1 $2\n" while <$mapIn>; close $mapIn;

Prints:

rs7837688 NP_817124 rs10499549 ZMYND11 rs3749375 ZMYND11

The output is in a different order than specified in the OP. If the order is important there are various ways of fixing the problem with very little extra performance cost.


Perl is environmentally friendly - it saves trees

Replies are listed 'Best First'.
Re^3: Searching Huge files
by biomonk (Acolyte) on Jul 08, 2008 at 04:20 UTC

    Hi GrandFather, can you please explain in more detail about the flow of program as i come from biology background and its my first Perl program.This logic is very helpful to me as it can be used number of time in my work, so i want to know about it rather than copying code and also please guide me in populating hash from a file , searching through a file. Thanks a lot.

      open my $snpIn, '<', \$snpFile uses a variable as though it were a file. It's a useful trick for test code because you don't need a separate file. Simply replace the \$snpFile bit with the file name you would normally use with the open.

      The code that populates the hash uses a couple of tricks so that it is compact. Expanded it might look like:

      while (<$snpIn>) { next unless /^(\w+)\s+(\d+\.\d+)/; $snpLookup{$1} = $2; }

      Note that the original code used while as a statement modifier and the code above uses unless as a statement modifier. Note too that the value given on the input line is the value associated with the key (the first 'word' on the line). You could instead assign $. which would give the line number, or you could ++$snpLookup{$1} instead which would give a count of the number of entries for that 'word' in the file.

      In like fashion the search loop can be expanded:

      while (<$mapIn>) { next unless /^(\w+)\s+(\w+)/ and exists $snpLookup{$1}; print "$1 $2\n"; }

      The important test is exists $snpLookup{$1} which tests to see if the first 'word' on the line was also a first 'word' in the first file using exists. The test is only made if the regular expression succeeds. Using the regular expression in that way avoids possible nastiness at the end of the file and maybe where the file format is not as you expect. See perlretut and perlre for more about regular expressions.


      Perl is environmentally friendly - it saves trees
        Thank you very much GrandFather, for taking out your time for me and writing such a detail explanation.

        Hello GrandFather , i have a new problem now , i need the score from the snp file(first file), now my output should something like this.

        rs7837688 NP_817124 9.85374546064131 rs10499549 ZMYND11 10.4656064706897 rs3749375 ZMYND11 11.7268615355335
        I'm confused can you help me please. Thank you in advance.
Re^2: Searching Huge files
by biomonk (Acolyte) on Jul 08, 2008 at 03:32 UTC
    Thank a lot your replies, those helped me a lot, especially GrandFather you rock.