in reply to Searching Huge files
A large part of the trick is to only make one pass through each file and to use a lookup in some in-memory structure to perform the test. If your data gets too big for that to work then you need to think about using a database, but for only a few million entries a hash works well. Consider:
use strict; use warnings; my $snpFile = <<SNP; snp_rs log_1_pval rs3749375 11.7268615355335 rs10499549 10.4656064706897 rs7837688 9.85374546064131 rs4794737 9.41576680248523 rs10033399 9.36407447191822 rs4242382 9.22809709356544 rs4242384 8.91767075801336 rs9656816 8.61480602028324 rs982354 8.40833878650415 rs31226 8.38047936810042 SNP my $mapFile = <<MAP; rs10904494 NP_817124 17881 rs7837688 NP_817124 39800 rs4881551 ZMYND11 21567 rs7909028 ZMYND11 5335 rs10499549 ZMYND11 0 rs12779173 ZMYND11 0 rs2448370 ZMYND11 0 rs2448366 ZMYND11 0 rs2379078 ZMYND11 0 rs3749375 ZMYND11 0 MAP my %snpLookup; # Populate the hash open my $snpIn, '<', \$snpFile or die "Can't open snp file: $!"; /^(\w+)\s+(\d+\.\d+)/ and $snpLookup{$1} = $2 while <$snpIn>; close $snpIn; # Perform the search and report open my $mapIn, '<', \$mapFile or die "Can't open map file: $!"; /^(\w+)\s+(\w+)/ and exists $snpLookup{$1} and print "$1 $2\n" while <$mapIn>; close $mapIn;
Prints:
rs7837688 NP_817124 rs10499549 ZMYND11 rs3749375 ZMYND11
The output is in a different order than specified in the OP. If the order is important there are various ways of fixing the problem with very little extra performance cost.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Searching Huge files
by biomonk (Acolyte) on Jul 08, 2008 at 04:20 UTC | |
by GrandFather (Saint) on Jul 08, 2008 at 04:46 UTC | |
by biomonk (Acolyte) on Jul 08, 2008 at 13:06 UTC | |
by biomonk (Acolyte) on Jul 09, 2008 at 20:36 UTC | |
by GrandFather (Saint) on Jul 09, 2008 at 21:27 UTC | |
by biomonk (Acolyte) on Jul 11, 2008 at 19:50 UTC | |
| |
|
Re^2: Searching Huge files
by biomonk (Acolyte) on Jul 08, 2008 at 03:32 UTC |