hash lookup

dr_joe has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed Monks,

This script I wrote to make two hashes and then search one hash with the keys from the other is only returning the correct result for the last key in the hash being searched. Additionally, it is taking a long time to run. I have reached the limits of my ability in trying to troubleshoot this. Any wisdom received from you will be much appreciated.

#!/usr/bin/perl -w
#GIDgiver.pl

#this script reads gene symbols and assigns the matching Entrez GeneID
+ to each gene


use strict;

open FH1, 'GID_GeneSymbol.txt', or die "Can't open GID file\n";
open FH2, 'CVD_genelist.txt', or die "Can't open CVD gene list\n";


our (@list1, @list2, @temp);
my %GID = ();
my %CVDgenes = ();

while (<FH1>){
    chomp;
    push (@list1, $_);
    foreach (@list1){
    @temp=split(/\s+/, $_);
    }
    %GID = map { $temp[0] => $temp[1] } @temp;

}

while (<FH2>){
    chomp;
    push (@list2, $_);
    }
    %CVDgenes = map { $_ => 1 } @list2;


while ((my $k, my $v) = each(%GID)) {
    if  (exists ($CVDgenes{$v})){
    print "The GID for gene $v is $k\n";
}}
[download]

Comment on hash lookup Download Code

Replies are listed 'Best First'.
Re: hash lookup by moritz (Cardinal) on Oct 12, 2009 at 21:33 UTC
`%GID = map { $temp[0] => $temp[1] } @temp;` This assigns two elements to `%GID`, throwing away all other items in that hash that might been in there before. Not what you want. Instead, just write `$GID{$temp[0]} = $temp[1]`. There's also no reason to build the `@list1` and `@list2` arrays. Finally you can get rid of `%GID` completely by first iterating over `<FH2>` and building `%CVDgenes`, and then iterate over `<FH1>` directly. Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l] [select]
Re^2: hash lookup by dr_joe (Initiate) on Oct 12, 2009 at 22:00 UTC
Thanks, Moritz. At your leisure, would you kindly give me a bit more detail about how to approach the last part of your suggestions (getting rid of %GID completely)?	[reply]
Re^3: hash lookup by moritz (Cardinal) on Oct 12, 2009 at 22:14 UTC
There's really not much detail to be given. Currently you iterate over one file, store all elements in a hash, and later on you iterate over all the data in the hash. Unless you need the hash for weeding out duplicates, you simply iterate over the data that you would have put into th hash when you need it. Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re: hash lookup by almut (Canon) on Oct 12, 2009 at 22:17 UTC
To elaborate on what moritz hinted at with respect to `@list1`: for every iteration (i.e. line in the file), the list gets longer, but you're only ever making effective use of the last entry anyway in what you do with the list: `foreach (@list1){ @temp=split(/\s+/, $_); }` [download] would leave in `@temp` only the result of splitting the last entry of `@list1`. This creates unnecessary work which is only making your program slow... Here's a simplified version (if I understood correctly what you're trying to do): `# create lookup table while (<FH2>) { chomp; $CVDgenes{$_} = 1; } # go through the other file and check if its entries # are found in the lookup table: while (<FH1>){ chomp; my ($gene, $gid) = split(/\s+/, $_); if (exists $CVDgenes{$gene}) { print "The GID for gene $gene is $gid\n"; } }` [download]	[reply] [d/l] [select]
Re^2: hash lookup by Anonymous Monk on Oct 12, 2009 at 22:45 UTC
OK, now I get it, and the code does exactly what I need it to. The insight from both of you is much appreciated.	[reply]
Re^3: hash lookup by dr_joe (Initiate) on Oct 12, 2009 at 22:47 UTC
That's me above, I forgot to log in.	[reply]