in reply to how to compare two hashes with perl?

It would help us a lot if you provided some example input and some sample output too (what you want and what you actually get ).

Do you ids come in the same order in the two files? In whick case there is no need to read in the whole of the first file, but you can just compare the two files on a line by line basis.

I would guess that your main source of inefficiency comes from how you are doing your comparisons:

for my $ID1 (keys %bow1) { for my $ID2 (keys %bow2) { ...

This will iterate though every ID in the first hash and compare it with every id in the second hash! From the sounds of it you are only interested in comparing entries with matching IDs, so why not use a hash as it was intended and look up the appropriate ID?

Also I don't understand why you are storing the IDs twice (both as the key and as an entry in the value array ( $bow2{$ID2}[0] = $ID2; )

Lastly and very much OT, you should always check for success on filehandle operations:

open (my $fh2, '<', "$file2") || die "Failed to open $file2 for readin +g : $!"; ...do stuff... close $fh2 || die "Failed to close $file2 : $!";
Just a something something...

Replies are listed 'Best First'.
Re^2: how to compare two hashes with perl?
by FluffyBunny (Acolyte) on Nov 04, 2009 at 21:01 UTC

    Thank you for your reply.

    IDs might not be in the same order that's why I'm looking for a certain ID I have in file 1 to match with any ID in file 2...

    This is what I wanted to check basically.

    1)Check ID names.

    2)If they match, and the sequences match, do not print.

    3)If they match, but the sequences do not match, print both ID and the sequences from each file.

    4)If they dont match, print both ID and the sequences from each file.

    I'm a newbie, and I'm trying to understand hash.. it's just confusing and I'm not exactly sure how my file gets stored in hash. I hear hash is random when it prints output and I want my ID doesn't get mixed with wrong sequences (an ID uniquely corresponds to each sequence).

    I updated the original post with my output and input files.

    Thank you!
      I hear hash is random when it prints output

      That just means that the order in which you add key/value pairs to a hash is not the order in which they are stored in the hash. Here is an example:

      use strict; use warnings; $\ = "\n"; $, = ', '; my %hash = (); $hash{"h"} = 10; $hash{"z"} = 20; $hash{"a"} = 30; foreach my $key (keys %hash) { print "$key: $hash{$key}"; } --output:-- a: 30 h: 10 z: 20

      However, the key/value pairs are the same. A key will never be associated with a value that you did not enter for that key.

      it's just confusing and I'm not exactly sure how my file gets stored in hash

      Take a look at this example:

      use strict; use warnings; $\ = "\n"; $, = ', '; my %results = (); my $line = 'HWUSI-EAS548:7:1:5:1527#0/1 + chr12 52084152 CGGAGC'; my @pieces = split /\s+/, $line; my $id = $pieces[0]; my $seq = $pieces[-1]; $results{$id} = $seq; foreach my $key (keys %results) { print "$key -----> $results{$key}"; } --output:-- HWUSI-EAS548:7:1:5:1527#0/1 -----> CGGAGC

      If you want to gather all the sequences corresponding to an id, you can do this:

      use strict; use warnings; $\ = "\n"; $, = ', '; my %results = (); while (<DATA>) { my @pieces = split /\s+/; my $id = $pieces[0]; my $seq = $pieces[-1]; $results{$id} = [] unless exists $results{$id}; push @{$results{$id}}, $seq; } foreach my $key (keys %results) { my $arr_str = join ',', @{$results{$key}}; print "$key -----> [$arr_str]"; } __DATA__ HWUSI-EAS548:7:1:5:1527#0/1 + chr12 52084152 CGGAGC HWUSI-EAS548:7:1:5:1527#0/1 + chr12 52084152 XXXXXX Some_other_id + chr12 52084152 CGGAGC

      You might want to experiment a little more with hashes in a separate practice program. For instance, you might want to read perlintro and perldsc, which you can read by typing:

      $ man perlintro or $ man perdsc

      For a complete list of topics available type:

      $man perl

      and scroll down.

        while (<DATA>) { my @pieces = split /\s+/; my $id = $pieces[0]; my $seq = $pieces[-1]; $results{$id} = [] unless exists $results{$id}; push @{$results{$id}}, $seq; }

        Actually, as perlreftut instructs, the line:

        $results{$id} = [] unless exists $results{$id};

        is unnecessary. I highly recommend that you read perlreftut:

        $ man perlreftut

      I take it this is bowtie output? It makes no sense to me why you are comparing all IDs in the first file to all IDs in the second? The whole point of using a hash is that you can look up specific keys, whereas an array would be for storing an ordered list.

      What are you actually trying to do? Get the common IDs between the files and say whether their associated sequences match? You can try something like this for that :

      foreach my $id (keys %hash1){ # you can use (sort keys %hash1) if you +want them in a specified order if ( exists $hash2{$id} ){ print "\'$id\' exists in both hashes.\n"; if ( $hash1{$id} eq $hash2{$id} ){ ## id and sequence are stored as key value pairs print "and the sequences match too.\n"; } else{ print "but the sequences do not match.\n"; } } else { print "\'$id\' only exists in hash1.\n"; } }

      If you want help with data strucutes, try perldsc for starters.

      Just a something something...

        Hello BioLion,

        Basically I followed your code,
        use warnings; use strict; my %bow1 = (); my $file1 = shift; open (FILE1, "$file1"); # Open first file while (<FILE1>) { my ($ID1, undef, undef, undef, $Seq1) = split; $bow1{$ID1} = $ID1; $bow1{$Seq1} = $Seq1; print STDERR "$bow1{$ID1}\t$bow1{$Seq1}\n"; } close FILE1; my %bow2 = (); my $file2 = shift; open (FILE2, "$file2"); # Open second file while (<FILE2>) { my ($ID2, undef, undef, undef, $Seq2) = split; $bow2{$ID2} = $ID2; $bow2{$Seq2} = $Seq2; print STDERR "$bow2{$ID2}\t$bow2{$Seq2}\n"; } close FILE2; foreach my $ID1 (keys %bow1){ # can use (sort keys %hash) to put items + in a specified order if ( exists $bow2{$ID2} ){ if ( $bow1{$ID1} eq $bow2{$ID2} ){ ## id and sequence are stored as key value pairs print "$bow1{$ID1} exists in $file1 and $file2 and the sequen +ces match $bow1{$Seq1} $bow2{$Seq2} \n"; } else{ print "$bow1{$ID1} exists in $file1 and $file2 but sequences D +O NOT match $bow1{$Seq1} $bow2{$Seq2} \n"; } } else { print "$bow1{$ID1} only exists in $file1 .\n"; } } exit;
        However I get some errors
        Global symbol "$ID2" requires explicit package name at /home/choia2/sc +ripts/BowtieCompare.pl line 50. Global symbol "$ID2" requires explicit package name at /home/choia2/sc +ripts/BowtieCompare.pl line 51. Global symbol "$Seq1" requires explicit package name at /home/choia2/s +cripts/BowtieCompare.pl line 53. Global symbol "$Seq2" requires explicit package name at /home/choia2/s +cripts/BowtieCompare.pl line 53. Global symbol "$Seq1" requires explicit package name at /home/choia2/s +cripts/BowtieCompare.pl line 56. Global symbol "$Seq2" requires explicit package name at /home/choia2/s +cripts/BowtieCompare.pl line 56. Execution of /home/choia2/scripts/BowtieCompare.pl aborted due to comp +ilation errors.

        Basically the foreach loop.. I never used hash for other programming languages (I wasn't professional though) but this hash concept is confusing.. could you help me one more time? >.<