BhariD has asked for the wisdom of the Perl Monks concerning the following question:

I have the following input file. The input file include pairs of two id's separated by tab. Each line with a pair is a combination and has relevance.

DATA

YP_01 NP_02

NP_02 YP_01

YP_01 NP_03

NP_03 YP_01

NP_02 NP_03

NP_03 NP_02

NP_04 NP_05

.....

Here is the code which I am using to get the reciprocal best hit pair from this input file. For example, in above input file, YP_01\tNP_02 and NP_02\tYP_01, is reciprocal best hit pair. And NP_04\tNP_05 is not reciprocal hit pair as it lacks the reciprocal combination in the file i.e. NP_05\tNP_04.

open (DATA, $results) || die "couldn't open the file!"; my %unique = (); while( <DATA> ){ chomp; $unique{ join( "\t", sort split /\t/ ) } ++; } my @pairs = (); my @not_pairs = (); for my $item (sort keys %unique ){ if( $unique{$item} > 1 ){ push @pairs, $item; }else{ push @not_pairs, $item; } } print join( "\n", @pairs ), "\n";

This gives the following result file

YP_01 NP_02

YP_01 NP_03

NP_02 NP_03

The problem is now I need to modify the code such that for above result file it produces the following result file instead

YP_01 NP_02

YP_01 NP_03

3

The idea is if YP_01 is paired with NP_02, and NP_03 and if NP_02 is paired with NP_03, then I do not need the pair NP_02 and NP_03 printed in the output file as their info already there in combination with YP_01. But I do need to know whether that pair NP_02 and NP_03 existed or not and that is why I need towards the end a #. As shown above I have "3" here in the end representing that the three best hit pairs are present (YP_01 and NP_02; YP_01 and NP_03; NP_02 and NP_03). If NP_02 and NP_03 was not present, then the number should change to 2 showing that out of 3 only 2 comparison are there.

could anyone help me with this. please let me know if you need further clarification

  • Comment on help needed in modifying the code for counting possible combinations
  • Download Code

Replies are listed 'Best First'.
Re: help needed in modifying the code for counting possible combinations
by GrandFather (Saint) on Oct 30, 2009 at 22:34 UTC

    use strict; use warnings; my %pairs; while (<DATA>) { chomp; next if ! length; my @pair = sort split; ++$pairs{$pair[0]}{$pair[1]}; } my $totalHits = 0; my %seconds; for my $first (sort keys %pairs) { for my $second (sort keys %{$pairs{$first}}) { next if $pairs{$first}{$second} <= 1; ++$totalHits; my $previous = grep {exists $pairs{$_}{$first} && exists $pairs{$_}{$seco +nd}} keys %{$seconds{$first}}; ++$seconds{$second}{$first}; next if $previous; print "$first $second\n"; } } print "$totalHits\n"; __DATA__ NP_01 NP_02 NP_02 NP_01 NP_01 NP_03 NP_03 NP_01 NP_02 NP_03 NP_03 NP_02 NP_04 NP_05

    Prints:

    NP_01 NP_02 NP_01 NP_03 3

    Note that I changed YP_01 to NP_01 to avoid inconsistencies between your reported results and the actual results.


    True laziness is hard work

      This is awesome! Thank you so much GrandFather. I tried with the following data:

      __DATA__ NP_01 NP_02 NP_02 NP_01 NP_01 NP_03 NP_03 NP_01 NP_02 NP_03 NP_03 NP_02 NP_04 NP_05 NP_06 NP_07 NP_07 NP_06 This prints: NP_01 NP_02 NP_01 NP_03 NP_06 NP_07 4

      I want the output to be like this instead:

      NP_01 NP_02 NP_01 NP_03 3 NP_06 NP_07 1

      3 for the presence of all three NP_01-NP_02, NP_01-NP_03, NP_02-NP_03 possible pairs in the file. Lets say, if NP_02-NP_03 combination was not present in the file then the number should become 2 showing that NP_02-NP_03 combo does not exist in the file. Any suggestion how can I get this from your code.

      Example in case when NP_02-NP_03 reciprocal pair does not exist in the file and the required output

      __DATA__ NP_01 NP_02 NP_02 NP_01 NP_01 NP_03 NP_03 NP_01 NP_04 NP_05 NP_06 NP_07 NP_07 NP_06 prints: NP_01 NP_02 NP_01 NP_03 2 [not present in NP_02 NP_03] NP_06 NP_07 1

        That makes it more interesting. If you discover any more interesting cases however you'd better tell us what the application actually is and give us the bigger picture.

        use strict; use warnings; my %pairs; while (<DATA>) { chomp; next if ! length; my @pair = sort split; ++$pairs{$pair[0]}{$pair[1]}; } for my $first (sort keys %pairs) { my @hits; my $count = 0; my @implied; for my $second (keys %{$pairs{$first}}) { next if $pairs{$first}{$second} < 2; push @implied, $second; ++$count; if (exists $pairs{$second}{$first}) { delete $pairs{$second}{$first}; ++$count; } print "$first $second\n"; } next if ! $count; @implied = sort @implied; if (@implied == 2 && $pairs{$implied[0]}{$implied[1]}) { ++$count; delete $pairs{$implied[0]}{$implied[1]}; @implied = (); } print "$count"; print " [not present in @implied]" if @implied == 2; print "\n\n" } __DATA__ AP_01 AP_02 AP_02 AP_01 AP_01 AP_03 AP_03 AP_01 NP_01 NP_02 NP_02 NP_01 NP_01 NP_03 NP_03 NP_01 NP_02 NP_03 NP_03 NP_02 NP_04 NP_05 NP_06 NP_07 NP_07 NP_06

        Prints:

        AP_01 AP_02 AP_01 AP_03 2 [not present in AP_02 AP_03] NP_01 NP_03 NP_01 NP_02 3 NP_06 NP_07 1

        I added the AP_dd set to make it clearer which groups were which and to provide data to demonstrate the third case.


        True laziness is hard work
Re: help needed in modifying the code for counting possible combinations
by gman (Friar) on Oct 30, 2009 at 20:16 UTC

    ok, maybe way off here but I'll give it a try

    #!/usr/bin/perl # use Data::Dumper; my $datafile = "test.dat"; my %finalQ = (); open(FH, "<$datafile") || die ; my %combined = (); while( <FH> ){ chomp; my ($key,$value) = sort (split(/\t/,$_)); $combined{$key}=$value; } my %count = (); my %reverse = reverse(%combined); foreach my $key (sort %combined) { print "$key => $combined{$key}\n"; if ( exists $reverse{$key} ) { $count{"$key\t$reverse{$key}"}+=2; print "$key => $reverse{$key} :: match\n"; } else { $count{"$key\t$combined{$key}"}++; } } print Dumper(%count);

    output

    $VAR1 = '       ';
    $VAR2 = 4;
    $VAR3 = 'NP_05  NP_04';
    $VAR4 = 2;
    $VAR5 = 'YP_01  NP_03';
    $VAR6 = 2;
    $VAR7 = 'NP_03  NP_02';
    $VAR8 = 4;
    $VAR9 = 'NP_02  NP_03';
    $VAR10 = 1;
    $VAR11 = 'NP_04 NP_05';
    $VAR12 = 1;
    
Re: help needed in modifying the code for counting possible combinations
by Anonymous Monk on Oct 30, 2009 at 19:15 UTC
    If you have four related points, what do you do?
    A B B C C D D A B D
    How do you tell if that last one was B-D or A-C if you write a 5 in its place?