Comparing two hashes-help

Gemchal has asked for the wisdom of the Perl Monks concerning the following question:

Hi All, I am new to perl so please be gentle. I am trying to compare two hashes. I have a reference text file (1) and another text file (2) i would like to compare (Eventually i have another 19 to compare to reference) what i want to do is compare the two text files and if there is a key (representing position) eg 245 in both text file 1 and 2 to keep the value from text file 2. Then i want to compare the next line in the files, if the position in text file 1 is missing in textfile 2 i want it to add that position and value to text file two. I hope this makes sense, i have pasted the script that i have written below (but the problem is when i run it i get an identical copy of text file 1 saved as text file two. Hope you can help Thanks Gemma :

#! /usr/bin/perl -w

use 5.010;

open (REF, "ref_snp.txt");
open (GENOTYPE, "$ARGV[0]");
open (OUT, ">$ARGV[1]");


my %genotype;
my %ref;
while(<GENOTYPE>) 

{
if (/(\d+)\t\w\t(\w)/) 
{
%genotype=($1=>$2,);
my @position=keys %genotype;
   #foreach $position (sort keys %genotype) 
   # {
   
# print OUT "$position\t$genotype{$position}\n";   
   # }
}
else 
{
print "It doesn'y match!$_ not match!\n\n";
}
   

}

while (<REF>)
{
if (/(\d+)\t(\w)/)
{
%ref = ($1=>$2,);
my @position_ref = keys %ref;
foreach $position_ref (sort keys %ref)
{
if (exists $genotype{$position_ref}) 
{
print OUT "$position_ref\t$genotype{$position_ref}\n";
}
else 
{
$genotype{$position_ref}=$ref{$position_ref};
print OUT "$position_ref\t$genotype{$position_ref}\n";
}
}
}

}
[download]

Comment on Comparing two hashes-help Download Code

Replies are listed 'Best First'.
Re: Comparing two hashes-help by toolic (Bishop) on Aug 20, 2010 at 13:39 UTC
I'll admit that your question is not very clear to me, but I think I see a problem with your hashes. Inside your while loop, you keep assigning just a single key/value pair to your hash (clobbering any old keys), whereas you probably want to keep adding keys to your hash: `%genotype=($1=>$2,);` [download] print out your hash after your while loop to see what I mean (using Data::Dumper): `use Data::Dumper; print Dumper(\%genotype);` [download] Maybe you want something like this: `$genotype{$1} = $2;` [download] Since you are new to Perl, maybe the Basic debugging checklist will come in handy. Tips # 1, 2, 4, 7 and 10 may apply here. If that is not your problem, you should include a very small sample of your input data.	[reply] [d/l] [select]
Re: Comparing two hashes-help by kennethk (Abbot) on Aug 20, 2010 at 13:59 UTC
In addition to what toolic says above, I'd point out you have a second instance of the bug he points out from line 17 on line 37 (`%ref = ($1=>$2,);` instead of `$ref{$1} = $2;`). You also have a lot of nested loops - For each line in `REF` you cycle over all keys in `%ref`. This algorithm will result in a large number of identical lines in your output file. I cannot really judge whether this is appropriate, though, since I have a little difficulty following your description. To make it more clear, post an example of input and expected output (wrapped in `<code>` tags to maintain format). An example is worth 1,000 words. See How do I post a question effectively?.	[reply] [d/l] [select]
Re: Comparing two hashes-help by MajingaZ (Beadle) on Aug 20, 2010 at 16:13 UTC
`my %genotype; while(<GENOTYPE>) { $genotype{$1} = $2 if (/(\d+)\t\w\t(\w)/); } while (<REF>) { next unless (/(\d+)\t(\w)/); print OUT (defined($genotype{$1}) and ($genotype ne '')) ? "$1\t$g +enotype{$1}\n" : "$1\t$2\n"; }` [download] Can't verify your regexes without sample data, given the files you dealing with I would have thought that the files might have been formatted the same. Your for loop when reading your ref file would make lots of extraneous entries. You are building your %ref hash and then every after every addition to it you dump out the entire contents, so your ref file would grow factorially! And unless you actually want that behavior, I see no reason here why you'd bother with the %ref at all. You appear to just want to update the genotypes located in the first file into your second file. So just put the positions and genotype of the first file into a hash. Then read the ref and see if a located position in the ref file has a value in the genotype hash, if yes then print that other wise print the ref values However you have not mentioned what you want if a position and genotype exists in your first file that is not in your reference. Perhaps that can never happen? Pending on your dataset size might just `for my $file (@files) {open (GENOTYPE, $file); blah blah}`	[reply] [d/l] [select]
Re: Comparing two hashes-help by locked_user sundialsvc4 (Abbot) on Aug 20, 2010 at 20:06 UTC
First of all, you should know that in this forum, you can be sure to be treated respectfully. No matter how “new to Perl” you might or might not be. This is a gathering-place of professionals. Second... I often find it useful to describe logic like this in terms of a finite-state machine (FSM). The idea here is actually a simple one, despite the obfuscatory-sounding name. It works like this: You say that the algorithm (“machine”) can be “in” one of several “states” based upon its recent history. (By definition, it always begins its life in some “initial” state.) The main program is just a loop. Each time through the loop, the program: Based on the current state, and whatever is presented to it at this time, it chooses the next state to be in. Then, it carries out some action appropriate to that new state. The program continues in this way until it reaches a “stop” state. So, to continue my extemporaneous-design exercise a bit farther, the algorithm could start in state `SKIPPING_FOR_IDENTICAL_VALUES` and, when it finds one, switch to `LOOKING_FOR_BLANK_IN_FILE_1`. It would continue until it finally reached a stop-state of `END_OF_BOTH_FILES_REACHED`. (I hope...) you get the idea from this barest of sketches, which by the way is not specific to Perl.