jpearl has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone, I'm working on a bacterial neighbor grouping problem. Basically the idea is there are many strains of a particular type of bacteria and we are looking at grouping 'sets' of these together based on relatedness (i.e. Bacteria_1 is more closely related to Bacteria_2 than Bacteria_3 because 1 and 2 have many more genes in common than 3). I have gotten to the point where I have 2d hashes representing all the groups I've found. However some of these groups overlap, for example:

%groupOfBact

looks like:

group1=>()

group2=>()

group3=>()

I would like to take %groupOfBact and merge the groups with common strains, while leaving the ones with no common strains intact. So, after the operation %groupOfBact would look like

group1=>()

group2=>()

I don't really care about what the values are afterwords (i.e. strain6=>10000 for all I care) and I think I have a structure that kind of works... but I feel like its really clunky, and even worse almost impossible for other people to read and understand. hrm, here's what it looks like:
#Handle the case of overlapping groups $groupID = 1; my %groupSeen; foreach my $group1 (keys %groupOfBact){ foreach my $group2 (keys %groupOfBact){ if ($group1 != $group2){ foreach my $strain (keys %{$groupOfBact{$group1}}){ if((exists $groupOfBact{$group2}{$strain})&&(!exists $ +groupSeen{$strain})){ %{$updatedGroupHash{$groupID}} = (%{$groupOfBact{$ +group1}}, %{$groupOfBact{$group2}}); foreach my $seen (keys %{$updatedGroupHash{$groupI +D}} ){ $groupSeen{$seen}=1; } $groupID++; last; } else { %{$updatedGroupHash{$groupID}} = %{$groupOfBact{$g +roup1}}; } } } } }
Any help would be greatly appreciated! Oh, I've also taken a look at Hash::Merge, but it seems to break with some message about odd numbers of elements... I could be using it incorrectly however (I'm still pretty new to perl)

Replies are listed 'Best First'.
Re: Merging Complex Hashes
by ig (Vicar) on Mar 11, 2009 at 19:43 UTC
      Thank you all! Algorithm::EquivalenceSets is perfect.

      Unfortunately I broke it when trying to install with CPAN. Devel::CheckOS is a pre-req and I think I accidentally hit "no" when I should have hit "yes" (it asked if I was running unix directly after asking if I was running linux! how was I supposed to know the correct answer was "yes"??). Anyway, now it won't let me install it. "clean Devel::CheckOS" doesn't appear to work. Oh well, I'll figure it out eventually. Thanks again for all your help.

        You can try cd ~/.cpan/build/Devel-CheckOS* then perl Makefile.PL. This should take you through the set of questions again and let you set it up right. I just accepted all the defaults, despite wondering whether Linux is Unix. Then the usual make, make test and make install.

Re: Merging Complex Hashes
by ikegami (Patriarch) on Mar 11, 2009 at 19:04 UTC
    Here's an efficient method:
    use strict; use warnings; my %groupOfBact = ( group1 => { strain1 => 1, strain2 => 1, strain4 => 1, }, group2 => { strain3 => 1, strain4 => 1, strain5 => 1, }, group3 => { strain6 => 1, strain7 => 1, }, ); my %bact_by_strain; for my $bact ( sort keys %groupOfBact ) { for my $strain ( keys %{ $groupOfBact{$bact} } ) { push @{ $bact_by_strain{$strain} }, $bact; } } my %delete; for my $strain ( keys %bact_by_strain ) { my $bacts = $bact_by_strain{$strain}; next if @$bacts < 2; my $dst = $bacts->[0]; for my $src ( @{$bacts}[ 1..$#$bacts ] ) { my $src_bact = $groupOfBact{$src}; my $dst_bact = $groupOfBact{$dst}; next if $src_bact == $dst_bact; %$dst_bact = ( %$dst_bact, %$src_bact ); $groupOfBact{$src} = $groupOfBact{$dst}; $delete{$src} = 1; } } delete @groupOfBact{ keys %delete }; use Data::Dumper qw( Dumper ); print(Dumper(\%groupOfBact));

    It still does some unnecessary merges, though. ( Actually no, I fixed that before posting )

    Update: This is the approach moritz mentioned.

      Hi Guys, thanks for getting back to me so quickly. I figured I would elucidate some more before I tried any of your suggestions. First of all, the idea is that these are all strains of the same species of bacteria. We have a clustering algorithm that goes through a set of the different strains, finds which genes in all strains are common (i.e. these are "core" genes) and then finds all the genes which are distributed (not unique to any one bacteria, but not ubiquitous throughout all strains). My program counts, for each strain, the number of distributed genes it has in common with any other particular strain. I divide that number by the total number of distributed genes and then subtract by one to find the "distance" between the strains.

      After that the bacteria are then grouped together by the nearest lying other bacteria. Groups consist of bacteria that all share at least one common nearest neighboring bacteria. I have that pretty much under control. However, now that I have a Hash that for each group has a list of these nearest neighboring bacteria, but in some cases they overlap. Say, Bacteria A's nearest neighbor is D so it goes to group one. B is nearest neighbor to C, so it goes into its own group two. But then D is nearest neighbor to B, so D and B go into group one, but B is also part of group 2, so those two first groups should just be one.

      Basically at this point the only decision that needs to be made is a binary "are any strains in group1 also in any other group? if so, merge the groups".
Re: Merging Complex Hashes
by moritz (Cardinal) on Mar 11, 2009 at 18:42 UTC
    One things I didn't quite understand is the following:

    Suppose you have four bacteria, A, B, C and D. A and B have one gene in common, B and C have one gene in common, and C and D have one in common. Should they all be merged into the same group?

    If yes, then your problem represents an undirected graph, and you're looking for connected subsets - which is and a fairly standard problem, and can be solved easily once you turn the hashes around, ie for each gene you store which bacteria they are part of.

Re: Merging Complex Hashes
by MaskedMarauder (Acolyte) on Mar 11, 2009 at 18:50 UTC

    I dunno... It looks more like an AI problem than a perl problem. What's the metric you use to determine 'sameness'? And, are the groups 'real' (ie have defined value/property independent of strains) or just buckets to put things in if they are different enough from others?

    My inclination would be to use numerical methods (eg principal component analysis) first. But I'm speaking as a biologist, not a perlie.