in reply to Should I use a hash for this?

Is your input a given or is %hash something you created?

In either case, I think you will find this problem much easier to solve if you store the groups and their members in an HoH (hash of hashes). By storing the list of genes in a string you are making the comparison much harder. To do any comparison you have to parse the string into individual genes and then search one group using the the genes from another. If the gene lists were stored in hashes, then most of this work would be done for you.

A hash of hashes for your data would look like this:

my %hGroups = ( 'Group1' => { 'ATRG7' => 1, 'ATG2' => 1, 'ATG4' => 1, 'ATG1' => 1 }, 'Group3' => { 'FYCO1' => 1, 'LSM2' => 1 }, 'Group2' => { 'ATG9' => 1, 'ATG1' => 1 } );

If your data is given to you using the comma delimited gene lists, you will need to convert %hash. You can use map, split, and keys to do the conversion:

my %hGroups = map { my $sGenes = $hash{$_}; my $hGroupMembers = { map { $_ => 1 } split(',', $sGenes) }; $_ => $hGroupMembers; } keys %hash;

Once you have your data in hash of hash form regrouping data can be done easily with the help of exists. In this code sample, %hGroups is the hash of hash above. %hNewGroups will store the new groupings:

my %hNewGroups; oldgroup: foreach my $sGroup (keys %hGroups) { my $hGroupMembers = $hGroups{$sGroup}; # check each new group for genes in common with # current old group ($sGroup) foreach my $sNewGroup (keys %hNewGroups) { # check genes in the old group to see if any are # in the new group ($sNewGroup). # Note: use exists to prevent auto-vivification # (automatic adding) of $sGene to the members hash my $hNewGroupMembers = $hNewGroups{$sNewGroup}; foreach my $sGene (keys %$hGroupMembers) { if (exists($hNewGroupMembers->{$sGene})) { $hNewGroups{$sNewGroup} = { %$hNewGroupMembers , %$hGroupMembers }; next oldgroup; } } } # create a new group, since no gene is in common with # other groups found so far $hNewGroups{$sGroup} = $hGroupMembers; }

The contents of %hNewGroups will be something like this:

%hNewGroups = ( 'Group1' => { 'ATG9' => 1, 'ATG2' => 1, 'ATRG7' => 1, 'ATG4' => 1, 'ATG1' => 1 }, 'Group3' => { 'FYCO1' => 1, 'LSM2' => 1 } );

You can always get back to comma delimited lists later on by using code like this:

while (my ($sNewGroup,$hMembers) = each(%hNewGroups)) { print "$sNewGroup: " . join(',', keys %$hMembers) . "\n"; }

which prints out

Group1: ATG9,ATG2,ATRG7,ATG4,ATG1 Group3: FYCO1,LSM2

Best, beth

Replies are listed 'Best First'.
Re^2: Should I use a hash for this?
by awos22 (Initiate) on Apr 16, 2009 at 19:29 UTC
    Thanks to all of you for helping me out with this!
    I'm a little embarrassed that I didn't check my thread before now...but I was trapped into doing other things grant related (have to pay the bills somehow).

    Anyway, now I can hopefully get back to the more fun stuff, which is of course writing Perl scripts. :)

    I will implement your various suggestions and see which one works the best and can handle some of the more complicated data that I will need to analyze. I just wanted to post something right away to let you all know how grateful I am for the amazing suggestions - you guys are awesome.

    I'll post something later to let you know how it all goes.

    By the way, nice job calling them genes Beth, I was wondering if anyone would notice. ;)

    -Mat