biomonk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I need little help with searching or Matching contents of huge files. I got two file, first file(snp file) contains ids and a score (which is about 2,00,000 lines) and second file(map file) contains ids, genes and a score(which is 3 to 4 times larger than first file). I need to search id of first file in second and if there is a match than print that line of second file into a new file.I wrote a program to do it , by using that it take days to complete,so I need your help in optimizing it and make it reasonably fast.

My files are:(Both files are tab delimited)

First file:

#snp file snp_rs log_1_pval rs3749375 11.7268615355335 rs10499549 10.4656064706897 rs7837688 9.85374546064131 rs4794737 9.41576680248523 rs10033399 9.36407447191822 rs4242382 9.22809709356544 rs4242384 8.91767075801336 rs9656816 8.61480602028324 rs982354 8.40833878650415 rs31226 8.38047936810042 ......... .........

Second file

#Map file rs10904494 NP_817124 17881 rs7837688 NP_817124 39800 rs4881551 ZMYND11 21567 rs7909028 ZMYND11 5335 rs10499549 ZMYND11 0 rs12779173 ZMYND11 0 rs2448370 ZMYND11 0 rs2448366 ZMYND11 0 rs2379078 ZMYND11 0 rs3749375 ZMYND11 0 ......... ....... .

My new file should look like this:

rs3749375 ZMYND11 rs10499549 ZMYND11 rs7837688 NP_817124

I can't use binary search as there files are not sorted and also i don't know how to do that.

Have a look at my code here

# This program is for getting snps and genes open(SNP,"D:\\gsea.chi2")or die("File cant be opened"); <SNP>; while($line = <SNP>){ @snps = split(/\t/,$line); pop(@snps); foreach $id(@snps){ #print "$id \n"; search($id); } } sub search { $snpid = $_[0]; #print "$snpid \n"; open (MAP,"D:\\gsea1.SNPGENEMAP")or die("File cant be opened"); my @map = <MAP>; close (MAP); foreach $mapid(@map){ if ($mapid =~ m/^$snpid/i){ print $mapid; last; } } }
Thank you very much in advance.

Replies are listed 'Best First'.
Re: Searching Huge files
by GrandFather (Saint) on Jul 08, 2008 at 02:08 UTC

    A large part of the trick is to only make one pass through each file and to use a lookup in some in-memory structure to perform the test. If your data gets too big for that to work then you need to think about using a database, but for only a few million entries a hash works well. Consider:

    use strict; use warnings; my $snpFile = <<SNP; snp_rs log_1_pval rs3749375 11.7268615355335 rs10499549 10.4656064706897 rs7837688 9.85374546064131 rs4794737 9.41576680248523 rs10033399 9.36407447191822 rs4242382 9.22809709356544 rs4242384 8.91767075801336 rs9656816 8.61480602028324 rs982354 8.40833878650415 rs31226 8.38047936810042 SNP my $mapFile = <<MAP; rs10904494 NP_817124 17881 rs7837688 NP_817124 39800 rs4881551 ZMYND11 21567 rs7909028 ZMYND11 5335 rs10499549 ZMYND11 0 rs12779173 ZMYND11 0 rs2448370 ZMYND11 0 rs2448366 ZMYND11 0 rs2379078 ZMYND11 0 rs3749375 ZMYND11 0 MAP my %snpLookup; # Populate the hash open my $snpIn, '<', \$snpFile or die "Can't open snp file: $!"; /^(\w+)\s+(\d+\.\d+)/ and $snpLookup{$1} = $2 while <$snpIn>; close $snpIn; # Perform the search and report open my $mapIn, '<', \$mapFile or die "Can't open map file: $!"; /^(\w+)\s+(\w+)/ and exists $snpLookup{$1} and print "$1 $2\n" while <$mapIn>; close $mapIn;

    Prints:

    rs7837688 NP_817124 rs10499549 ZMYND11 rs3749375 ZMYND11

    The output is in a different order than specified in the OP. If the order is important there are various ways of fixing the problem with very little extra performance cost.


    Perl is environmentally friendly - it saves trees

      Hi GrandFather, can you please explain in more detail about the flow of program as i come from biology background and its my first Perl program.This logic is very helpful to me as it can be used number of time in my work, so i want to know about it rather than copying code and also please guide me in populating hash from a file , searching through a file. Thanks a lot.

        open my $snpIn, '<', \$snpFile uses a variable as though it were a file. It's a useful trick for test code because you don't need a separate file. Simply replace the \$snpFile bit with the file name you would normally use with the open.

        The code that populates the hash uses a couple of tricks so that it is compact. Expanded it might look like:

        while (<$snpIn>) { next unless /^(\w+)\s+(\d+\.\d+)/; $snpLookup{$1} = $2; }

        Note that the original code used while as a statement modifier and the code above uses unless as a statement modifier. Note too that the value given on the input line is the value associated with the key (the first 'word' on the line). You could instead assign $. which would give the line number, or you could ++$snpLookup{$1} instead which would give a count of the number of entries for that 'word' in the file.

        In like fashion the search loop can be expanded:

        while (<$mapIn>) { next unless /^(\w+)\s+(\w+)/ and exists $snpLookup{$1}; print "$1 $2\n"; }

        The important test is exists $snpLookup{$1} which tests to see if the first 'word' on the line was also a first 'word' in the first file using exists. The test is only made if the regular expression succeeds. Using the regular expression in that way avoids possible nastiness at the end of the file and maybe where the file format is not as you expect. See perlretut and perlre for more about regular expressions.


        Perl is environmentally friendly - it saves trees
      Thank a lot your replies, those helped me a lot, especially GrandFather you rock.
Re: Searching Huge files
by Tanktalus (Canon) on Jul 08, 2008 at 01:53 UTC

    My first thought is to see if DBD::CSV would be able to handle this. I suspect it could. Then you'd be able to merely issue an SQL statement to get the data you need. This will likely be the biggest bang for the buck - though that probably won't be much bang, it also won't be much buck (cheap to implement).

    Once you have that working, the next step may be to simply load this data into a real database (whether sqlite on one end or DB2 or Oracle or whatever at the other end), which should be fast, and then ask the database to return the result for what should be basically the same query. This will take a bit more to set up, so if it saves anything, it will need to be significant enough to overcome the setup cost (copying of data into the database) just to be worth it. This will take longer (setting up a real database server, even if it's just sqlite), so the return on investment may not be as high - depends on how important it is to you for your transform to perform quickly. Even here, there'll be room for tweaking, based on how you set up your indexes, etc. - if you have in-house database experts, you may want their input on this.

    Just my two cents :-)

Re: Searching Huge files
by dragonchild (Archbishop) on Jul 08, 2008 at 13:37 UTC
    Use a database. This is exactly what a database is designed to do. Any work you do would simply be replicating what has literally had man-DECADES thrown at it.
    1. Load files into a database. 2M rows and 8M rows are small-ish tables.
    2. Run SQL queries against those tables.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

      I thank all of you for your support and quick response, i now feel that i'm not alone in this Perl world.

Re: Searching Huge files
by jethro (Monsignor) on Jul 08, 2008 at 02:26 UTC
    You have to put one of the two files into a hash, it doesn't really matter which one. Since the hash won't fit into memory, it must be put on disc, either into a real database like mysql or in this case better into a DBM::Deep database.

    Then just loop through the other file and look up the ids in the hash. Should be a speedup million fold speedup, literally.

      You have to put one of the two files into a hash, it doesn't really matter which one.

      Actually, there's a good chance that it does matter. If one file has about 2 million rows/keys, and the other has about 8 million, it will take noticeably less resources and time to store the keys of the smaller file into a hash. As GrandFather suggested above, there's a reasonable chance that a hash of 2 million elements could fit into RAM without causing the machine to flail due to the virtual memory content being bounced back and forth between RAM and swap file.

      But whether it's in-memory or in a DBM file of some sort, creating 2 million keys will be quicker than 8 million (and it would just seem to make more sense). Of course, once a hash has been built, access time is not likely to differ all that much (except when an "in-memory" hash is big enough to induce swapping), but the time/space needed to build the hash may differ significantly depending on the quantity of elements involved.