in reply to Re^2: help needed in modifying the code for counting possible combinations
in thread help needed in modifying the code for counting possible combinations

That makes it more interesting. If you discover any more interesting cases however you'd better tell us what the application actually is and give us the bigger picture.

use strict; use warnings; my %pairs; while (<DATA>) { chomp; next if ! length; my @pair = sort split; ++$pairs{$pair[0]}{$pair[1]}; } for my $first (sort keys %pairs) { my @hits; my $count = 0; my @implied; for my $second (keys %{$pairs{$first}}) { next if $pairs{$first}{$second} < 2; push @implied, $second; ++$count; if (exists $pairs{$second}{$first}) { delete $pairs{$second}{$first}; ++$count; } print "$first $second\n"; } next if ! $count; @implied = sort @implied; if (@implied == 2 && $pairs{$implied[0]}{$implied[1]}) { ++$count; delete $pairs{$implied[0]}{$implied[1]}; @implied = (); } print "$count"; print " [not present in @implied]" if @implied == 2; print "\n\n" } __DATA__ AP_01 AP_02 AP_02 AP_01 AP_01 AP_03 AP_03 AP_01 NP_01 NP_02 NP_02 NP_01 NP_01 NP_03 NP_03 NP_01 NP_02 NP_03 NP_03 NP_02 NP_04 NP_05 NP_06 NP_07 NP_07 NP_06

Prints:

AP_01 AP_02 AP_01 AP_03 2 [not present in AP_02 AP_03] NP_01 NP_03 NP_01 NP_02 3 NP_06 NP_07 1

I added the AP_dd set to make it clearer which groups were which and to provide data to demonstrate the third case.


True laziness is hard work

Replies are listed 'Best First'.
Re^4: help needed in modifying the code for counting possible combinations
by BhariD (Sexton) on Nov 01, 2009 at 01:36 UTC

    Actually, The idea behind parsing this data file is a biological concept and is to extract "orthologous sequences". Orthologous sequences are sequences which belong to different species and have a common homologue exactly in the common ancestor of both species. For example, a pair with sequence id's: NP_01-NP_02 && NP_02-NP_01 is a stable pair of orthologous sequences in two species 01 and 02.

    In case of three species pairwise comparisons, So for NP_01, NP_02, and NP_03, there could be 3 possible stable pairs of orthologous sequences. And this is why I needed to know if it is present in all or not if not then it cannot be defined as an "ortholog set" or an orthologous sequence present in all three species.

    I really appreciate your help. But you know what nothing is easy, as I was going through my datafile, I found that there are pairwise comparisons between more than three species, sometimes even 13 species (see sample below) and that makes it even more complicated..

    added section of datafile NP_08 NP_09 NP_08 NP_10 NP_08 NP_11 NP_08 NP_12 NP_08 NP_13 NP_09 NP_10 NP_09 NP_11 NP_09 NP_13 NP_12 NP_13

    I did not show the reciprocals of the pairs (that exist) just for the convenience..

    Thank you again!