in reply to help needed in modifying the code for counting possible combinations

use strict; use warnings; my %pairs; while (<DATA>) { chomp; next if ! length; my @pair = sort split; ++$pairs{$pair[0]}{$pair[1]}; } my $totalHits = 0; my %seconds; for my $first (sort keys %pairs) { for my $second (sort keys %{$pairs{$first}}) { next if $pairs{$first}{$second} <= 1; ++$totalHits; my $previous = grep {exists $pairs{$_}{$first} && exists $pairs{$_}{$seco +nd}} keys %{$seconds{$first}}; ++$seconds{$second}{$first}; next if $previous; print "$first $second\n"; } } print "$totalHits\n"; __DATA__ NP_01 NP_02 NP_02 NP_01 NP_01 NP_03 NP_03 NP_01 NP_02 NP_03 NP_03 NP_02 NP_04 NP_05

Prints:

NP_01 NP_02 NP_01 NP_03 3

Note that I changed YP_01 to NP_01 to avoid inconsistencies between your reported results and the actual results.


True laziness is hard work

Replies are listed 'Best First'.
Re^2: help needed in modifying the code for counting possible combinations
by BhariD (Sexton) on Oct 31, 2009 at 14:40 UTC

    This is awesome! Thank you so much GrandFather. I tried with the following data:

    __DATA__ NP_01 NP_02 NP_02 NP_01 NP_01 NP_03 NP_03 NP_01 NP_02 NP_03 NP_03 NP_02 NP_04 NP_05 NP_06 NP_07 NP_07 NP_06 This prints: NP_01 NP_02 NP_01 NP_03 NP_06 NP_07 4

    I want the output to be like this instead:

    NP_01 NP_02 NP_01 NP_03 3 NP_06 NP_07 1

    3 for the presence of all three NP_01-NP_02, NP_01-NP_03, NP_02-NP_03 possible pairs in the file. Lets say, if NP_02-NP_03 combination was not present in the file then the number should become 2 showing that NP_02-NP_03 combo does not exist in the file. Any suggestion how can I get this from your code.

    Example in case when NP_02-NP_03 reciprocal pair does not exist in the file and the required output

    __DATA__ NP_01 NP_02 NP_02 NP_01 NP_01 NP_03 NP_03 NP_01 NP_04 NP_05 NP_06 NP_07 NP_07 NP_06 prints: NP_01 NP_02 NP_01 NP_03 2 [not present in NP_02 NP_03] NP_06 NP_07 1

      That makes it more interesting. If you discover any more interesting cases however you'd better tell us what the application actually is and give us the bigger picture.

      use strict; use warnings; my %pairs; while (<DATA>) { chomp; next if ! length; my @pair = sort split; ++$pairs{$pair[0]}{$pair[1]}; } for my $first (sort keys %pairs) { my @hits; my $count = 0; my @implied; for my $second (keys %{$pairs{$first}}) { next if $pairs{$first}{$second} < 2; push @implied, $second; ++$count; if (exists $pairs{$second}{$first}) { delete $pairs{$second}{$first}; ++$count; } print "$first $second\n"; } next if ! $count; @implied = sort @implied; if (@implied == 2 && $pairs{$implied[0]}{$implied[1]}) { ++$count; delete $pairs{$implied[0]}{$implied[1]}; @implied = (); } print "$count"; print " [not present in @implied]" if @implied == 2; print "\n\n" } __DATA__ AP_01 AP_02 AP_02 AP_01 AP_01 AP_03 AP_03 AP_01 NP_01 NP_02 NP_02 NP_01 NP_01 NP_03 NP_03 NP_01 NP_02 NP_03 NP_03 NP_02 NP_04 NP_05 NP_06 NP_07 NP_07 NP_06

      Prints:

      AP_01 AP_02 AP_01 AP_03 2 [not present in AP_02 AP_03] NP_01 NP_03 NP_01 NP_02 3 NP_06 NP_07 1

      I added the AP_dd set to make it clearer which groups were which and to provide data to demonstrate the third case.


      True laziness is hard work

        Actually, The idea behind parsing this data file is a biological concept and is to extract "orthologous sequences". Orthologous sequences are sequences which belong to different species and have a common homologue exactly in the common ancestor of both species. For example, a pair with sequence id's: NP_01-NP_02 && NP_02-NP_01 is a stable pair of orthologous sequences in two species 01 and 02.

        In case of three species pairwise comparisons, So for NP_01, NP_02, and NP_03, there could be 3 possible stable pairs of orthologous sequences. And this is why I needed to know if it is present in all or not if not then it cannot be defined as an "ortholog set" or an orthologous sequence present in all three species.

        I really appreciate your help. But you know what nothing is easy, as I was going through my datafile, I found that there are pairwise comparisons between more than three species, sometimes even 13 species (see sample below) and that makes it even more complicated..

        added section of datafile NP_08 NP_09 NP_08 NP_10 NP_08 NP_11 NP_08 NP_12 NP_08 NP_13 NP_09 NP_10 NP_09 NP_11 NP_09 NP_13 NP_12 NP_13

        I did not show the reciprocals of the pairs (that exist) just for the convenience..

        Thank you again!