sesemin has asked for the wisdom of the Perl Monks concerning the following question:

I have got two files. One with two columns and the other one with 10 columns File-1 col one a name that repeats for every increment in col 2
ABCD_231\t231 ABCD_231\t232 . . . . ACDF_400\t1 ACDF_400\t2 . . .
File 2 is the same with more columns but the first two columns have elements like file 1 but there are eleven million lines. File 2 also has a column that I need to set a condition for that. For example, if row 10 of file one which is now in a hash matches with row 5000 (the first two columns which are now in the second hash) of file two, I need to look at the value of another column in file 2 and make sure it is greater than 3 for instance. When these conditions met, I need to write the common elements into a file. here is my code:
my %file1=(); while(<INPUT1>){ (my $id, my $number) = split("\t", $_); if ($id=~ m/^(CLS_S3_Contig[0-9]+)([-]?)([0-9]+)([_]?)([0-9] ++)$/i) { my $matched_id=$id; # breaks the CLS_Contig1000_200-202 +to its componenents for (my $i=$3-8;$i<$5+9;$i++){ push (@{$file1{$1}}, $i); # This makes a hash of names a +nd each position plus/minus 8 - this is the hash with each key have m +ultiple values. } }#end for if else {my $mismatch_id=$id; # print "$mismatch_id does not match with CLS_S3_Contig +s!\n"; } # end for else } # end for while #reading the file 2 with several columns. columns 2, 5 make a hash of +multiple value and $mismatch column is the one that need to meet the +condition while(<INPUT2>){ (my $serial_no ,my $contig_id, my $position_with_gap, my $gap, + my $position_corrected, my $ATGCN, my $depth, my $consensus, my $mis +match, my $star, my $geno_A, my$geno_B) = split("\t", $_); push (@{$file2{$contig_id}}, $position_corrected); #Make a ha +sh of contig ID and base position one key multiple values push (@{$file2_2{$position_corrected}}, $mismatch); } #end for while # Here we are going to have access each element of hash of hash my $rHoH = foo();my %hash_1; my( $contig_id, $position_corrected, $mismatch ); for my $serial_no ( keys %$rHoH ) { $contig_id = $rHoH->{ $serial_no }->{ 'contigID' }; $position_corrected = $rHoH->{ $serial_no}->{ 'position_correc +ted' }; $mismatch = $rHoH->{ $serial_no }->{ 'mismatch' }; # Now we want to know how many of contigs contain more than three mism +atch #Hash_1 here if ($mismatch >= 3) { #$hash_1{$contig_id} = $position_corrected; # this will result + a hash with name of contig and only one value per contig # that is why it i +s commented here push (@{$hash_1{$contig_id}}, $position_corrected); # This mak +e a hash with one key wiht multiple values #print RESULTS "$contig_id\t$position_corrected\t$mismatch\n"; + # This prints all contige that have more than 3 mismatch. } } # here is where I messed up for the query. I cannot control this loop. + it finds the things but fails to print them only once. foreach $1 (sort keys %file1){ foreach my $position1 (@{$file1{$1}}){ $found =0; foreach $contig_id(sort keys %hash_1){ foreach my $position (@{$hash_1{$contig_id}} +){ $found = 1 if $1 =~ /^$contig_id/ && $contig_id=~ /^$1/ && $po +sition1==$position; print RESULTS "$position1\t$1\n" if $found; print "not matched\n" if !$found; } } } } ############################################################## sub foo { my ( $serial_no ,$contig_id, $position_with_gap, $gap, $positi +on_corrected, $ATGCN, $depth, $consensus, $mismatch, $star, $geno_A, +$geno_B); my %HoH = (); open(INPUT2,$ARGV[1]) || die "Cannot open file \"$ARGV[1]\""; +# MAP file while( <INPUT2> ) { ( $serial_no ,$contig_id, $position_with_gap, $gap, $positi +on_corrected, $ATGCN, $depth, $consensus, $mismatch, $star, $geno_A, +$geno_B) = split("\t", $_); $HoH{$serial_no} {'contigID'} = $contig_id; $HoH{$serial_no} {'position_corrected'} = $position_corrected; $HoH{$serial_no} {'mismatch'} = $mismatch; } return \%HoH; }

Replies are listed 'Best First'.
Re: Hash - Compare - Multiple value keys
by RMGir (Prior) on Sep 07, 2008 at 11:57 UTC
    From your comments, it sounds like you're unhappy with this loop?
    # here is where I messed up for the query. I cannot control this loop. # it finds the things but fails to print them only once. foreach $1 (sort keys %file1){ foreach my $position1 (@{$file1{$1}}){ $found =0; foreach $contig_id(sort keys %hash_1){ foreach my $position (@{$hash_1{$contig_id}}){ $found = 1 if $1 =~ /^$contig_id/ && $contig_id=~ /^$1 +/ && $position1==$position; print RESULTS "$position1\t$1\n" if $found; print "not matched\n" if !$found; } } } }
    What I can't understand from either your description OR your code is what you intend to happen here.

    As it's written, $found gets set to 0 only in the middle foreach loop, so if it's ever set to 1 in the inner loop, it'll remain 1 for all the rest of the inner loop iterations. Is that what you want?

    Try to explain clearly what you want this loop to do, both to yourself and to us. Odds are that explaining it to yourself will be enough for you to figure out a better way to code it, but if it isn't, we're here to help...


    Mike
      Hi Mike, Thanks for the comment. In the inner loop, after the condition is met, I want to print the result only once. then the loop should continue to check other records to see if it can find the match. Here I have included a partial output of this code. I just need to print the first line. where col 1 and 4 pair and 2 and 3 pair. this is were the condition is met. Then it should find the next match before printing ~4000 junk comparisons.
      CLS_S3_Contig10021 349 349 CLS_S3_Contig10021 CLS_S3_Contig10021 898 349 CLS_S3_Contig10021 CLS_S3_Contig10021 524 349 CLS_S3_Contig10021 CLS_S3_Contig10021 365 349 CLS_S3_Contig10021 CLS_S3_Contig10021 373 349 CLS_S3_Contig10021 CLS_S3_Contig10029 857 349 CLS_S3_Contig10021 CLS_S3_Contig10029 482 349 CLS_S3_Contig10021 CLS_S3_Contig10029 676 349 CLS_S3_Contig10021 CLS_S3_Contig10029 153 349 CLS_S3_Contig10021 CLS_S3_Contig10031 797 349 CLS_S3_Contig10021 CLS_S3_Contig10031 587 349 CLS_S3_Contig10021 CLS_S3_Contig10031 227 349 CLS_S3_Contig10021 CLS_S3_Contig10031 314 349 CLS_S3_Contig10021 CLS_S3_Contig10031 605 349 CLS_S3_Contig10021 CLS_S3_Contig10031 257 349 CLS_S3_Contig10021 CLS_S3_Contig10031 212 349 CLS_S3_Contig10021 CLS_S3_Contig10031 857 349 CLS_S3_Contig10021 CLS_S3_Contig10031 635 349 CLS_S3_Contig10021 CLS_S3_Contig10031 188 349 CLS_S3_Contig10021 CLS_S3_Contig10031 410 349 CLS_S3_Contig10021 CLS_S3_Contig10031 806 349 CLS_S3_Contig10021 CLS_S3_Contig10040 439 349 CLS_S3_Contig10021 CLS_S3_Contig10051 719 349 CLS_S3_Contig10021 CLS_S3_Contig10051 92 349 CLS_S3_Contig10021 CLS_S3_Contig1006 279 349 CLS_S3_Contig10021 CLS_S3_Contig1006 240 349 CLS_S3_Contig10021 CLS_S3_Contig1006 168 349 CLS_S3_Contig10021 CLS_S3_Contig10070 196 349 CLS_S3_Contig10021 CLS_S3_Contig10072 882 349 CLS_S3_Contig10021 CLS_S3_Contig10072 685 349 CLS_S3_Contig10021 CLS_S3_Contig10072 892 349 CLS_S3_Contig10021 CLS_S3_Contig10072 237 349 CLS_S3_Contig10021 CLS_S3_Contig10072 858 349 CLS_S3_Contig10021 CLS_S3_Contig10083 868 349 CLS_S3_Contig10021 CLS_S3_Contig1010 774 349 CLS_S3_Contig10021 CLS_S3_Contig1010 613 349 CLS_S3_Contig10021 CLS_S3_Contig10134 452 349 CLS_S3_Contig10021 CLS_S3_Contig10157 545 349 CLS_S3_Contig10021 CLS_S3_Contig10157 500 349 CLS_S3_Contig10021 CLS_S3_Contig10157 875 349 CLS_S3_Contig10021 CLS_S3_Contig10157 404 349 CLS_S3_Contig10021
        So what's the purpose of $found?

        If all you need is to find out if this line matches criteria and do something based on that, that's pretty well what "if" statements are for...

        foreach $1 (sort keys %file1){ foreach my $position1 (@{$file1{$1}}){ foreach $contig_id(sort keys %hash_1){ foreach my $position (@{$hash_1{$contig_id}}){ if($1 =~ /^$contig_id/ && $contig_id=~ /^$1/ && $posit +ion1==$position) { print RESULTS "$position1\t$1\n"; } else { print "not matched\n"; } } } } }

        Mike
Re: Hash - Compare - Multiple value keys
by RMGir (Prior) on Sep 07, 2008 at 11:02 UTC
    I have to go run an errand, so I don't have time to analyze your code.

    But to make things easier for others, I've re-indented it.

    Once I get back, if no one else has figured this out, I'll take a look during the F1 race.

    my %file1=(); while(<INPUT1>){ (my $id, my $number) = split("\t", $_); if ($id=~ m/^(CLS_S3_Contig[0-9]+)([-]?)([0-9]+)([_]?)([0-9]+)$/i) + { my $matched_id=$id; # breaks the CLS_Contig1000_200-202 to its + componenents for (my $i=$3-8;$i<$5+9;$i++){ push (@{$file1{$1}}, $i); # This makes a hash of names and + each position # plus/minus 8 - this is the has +h with each key # have multiple values. } }#end for if else {my $mismatch_id=$id; # print "$mismatch_id does not match with CLS_S3_Contigs!\n"; } # end for else } # end for while #reading the file 2 with several columns. columns 2, 5 make a hash of +multiple #value and $mismatch column is the one that need to meet the condition while(<INPUT2>){ my ($serial_no ,$contig_id, $position_with_gap, $gap, $position_corrected, $ATGCN, $depth, $consensus, $mismatch, $star, $geno_A, $geno_B) = split("\t", $_) +; push (@{$file2{$contig_id}}, $position_corrected); #Make a hash of + contig ID #and base posit +ion one key #multiple value +s push (@{$file2_2{$position_corrected}}, $mismatch); } #end for while # Here we are going to have access each element of hash of hash my $rHoH = foo();my %hash_1; my( $contig_id, $position_corrected, $mismatch ); for my $serial_no ( keys %$rHoH ) { $contig_id = $rHoH->{ $serial_no }->{ 'contigID' }; $position_corrected = $rHoH->{ $serial_no}->{ 'position_corrected' + }; $mismatch = $rHoH->{ $serial_no }->{ 'mismatch' }; # Now we want to know how many of contigs contain more than three mism +atch #Hash_1 here if ($mismatch >= 3) { #$hash_1{$contig_id} = $position_corrected; ## this will result a hash with name of contig and only one va +lue per contig # that is why it is commented here push (@{$hash_1{$contig_id}}, $position_corrected); # This mak +e a hash # with one + key with # multiple + values #print RESULTS "$contig_id\t$position_corrected\t$mismatch\n"; + # This prints all contige that have more than 3 mismatch. } } # here is where I messed up for the query. I cannot control this loop. # it finds the things but fails to print them only once. foreach $1 (sort keys %file1){ foreach my $position1 (@{$file1{$1}}){ $found =0; foreach $contig_id(sort keys %hash_1){ foreach my $position (@{$hash_1{$contig_id}}){ $found = 1 if $1 =~ /^$contig_id/ && $contig_id=~ /^$1 +/ && $position1==$position; print RESULTS "$position1\t$1\n" if $found; print "not matched\n" if !$found; } } } } ############################################################## sub foo { my ( $serial_no ,$contig_id, $position_with_gap, $gap, $position_corrected, $ATGCN, $depth, $consensus, $mismatch, $star, $geno_A, $geno_B); my %HoH = (); open(INPUT2,$ARGV[1]) || die "Cannot open file \"$ARGV[1]\""; # MA +P file while( <INPUT2> ) { ( $serial_no ,$contig_id, $position_with_gap, $gap, $position_ +corrected, $ATGCN, $depth, $consensus, $mismatch, $star, $geno_A, $gen +o_B) = split("\t", $_); $HoH{$serial_no} {'contigID'} = $contig_id; $HoH{$serial_no} {'position_corrected'} = $position_corrected; $HoH{$serial_no} {'mismatch'} = $mismatch; } return \%HoH; }

    Mike