sesemin has asked for the wisdom of the Perl Monks concerning the following question:
File 2 is the same with more columns but the first two columns have elements like file 1 but there are eleven million lines. File 2 also has a column that I need to set a condition for that. For example, if row 10 of file one which is now in a hash matches with row 5000 (the first two columns which are now in the second hash) of file two, I need to look at the value of another column in file 2 and make sure it is greater than 3 for instance. When these conditions met, I need to write the common elements into a file. here is my code:ABCD_231\t231 ABCD_231\t232 . . . . ACDF_400\t1 ACDF_400\t2 . . .
my %file1=(); while(<INPUT1>){ (my $id, my $number) = split("\t", $_); if ($id=~ m/^(CLS_S3_Contig[0-9]+)([-]?)([0-9]+)([_]?)([0-9] ++)$/i) { my $matched_id=$id; # breaks the CLS_Contig1000_200-202 +to its componenents for (my $i=$3-8;$i<$5+9;$i++){ push (@{$file1{$1}}, $i); # This makes a hash of names a +nd each position plus/minus 8 - this is the hash with each key have m +ultiple values. } }#end for if else {my $mismatch_id=$id; # print "$mismatch_id does not match with CLS_S3_Contig +s!\n"; } # end for else } # end for while #reading the file 2 with several columns. columns 2, 5 make a hash of +multiple value and $mismatch column is the one that need to meet the +condition while(<INPUT2>){ (my $serial_no ,my $contig_id, my $position_with_gap, my $gap, + my $position_corrected, my $ATGCN, my $depth, my $consensus, my $mis +match, my $star, my $geno_A, my$geno_B) = split("\t", $_); push (@{$file2{$contig_id}}, $position_corrected); #Make a ha +sh of contig ID and base position one key multiple values push (@{$file2_2{$position_corrected}}, $mismatch); } #end for while # Here we are going to have access each element of hash of hash my $rHoH = foo();my %hash_1; my( $contig_id, $position_corrected, $mismatch ); for my $serial_no ( keys %$rHoH ) { $contig_id = $rHoH->{ $serial_no }->{ 'contigID' }; $position_corrected = $rHoH->{ $serial_no}->{ 'position_correc +ted' }; $mismatch = $rHoH->{ $serial_no }->{ 'mismatch' }; # Now we want to know how many of contigs contain more than three mism +atch #Hash_1 here if ($mismatch >= 3) { #$hash_1{$contig_id} = $position_corrected; # this will result + a hash with name of contig and only one value per contig # that is why it i +s commented here push (@{$hash_1{$contig_id}}, $position_corrected); # This mak +e a hash with one key wiht multiple values #print RESULTS "$contig_id\t$position_corrected\t$mismatch\n"; + # This prints all contige that have more than 3 mismatch. } } # here is where I messed up for the query. I cannot control this loop. + it finds the things but fails to print them only once. foreach $1 (sort keys %file1){ foreach my $position1 (@{$file1{$1}}){ $found =0; foreach $contig_id(sort keys %hash_1){ foreach my $position (@{$hash_1{$contig_id}} +){ $found = 1 if $1 =~ /^$contig_id/ && $contig_id=~ /^$1/ && $po +sition1==$position; print RESULTS "$position1\t$1\n" if $found; print "not matched\n" if !$found; } } } } ############################################################## sub foo { my ( $serial_no ,$contig_id, $position_with_gap, $gap, $positi +on_corrected, $ATGCN, $depth, $consensus, $mismatch, $star, $geno_A, +$geno_B); my %HoH = (); open(INPUT2,$ARGV[1]) || die "Cannot open file \"$ARGV[1]\""; +# MAP file while( <INPUT2> ) { ( $serial_no ,$contig_id, $position_with_gap, $gap, $positi +on_corrected, $ATGCN, $depth, $consensus, $mismatch, $star, $geno_A, +$geno_B) = split("\t", $_); $HoH{$serial_no} {'contigID'} = $contig_id; $HoH{$serial_no} {'position_corrected'} = $position_corrected; $HoH{$serial_no} {'mismatch'} = $mismatch; } return \%HoH; }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Hash - Compare - Multiple value keys
by RMGir (Prior) on Sep 07, 2008 at 11:57 UTC | |
by sesemin (Beadle) on Sep 07, 2008 at 17:26 UTC | |
by RMGir (Prior) on Sep 07, 2008 at 18:37 UTC | |
|
Re: Hash - Compare - Multiple value keys
by RMGir (Prior) on Sep 07, 2008 at 11:02 UTC |