Combining hashes of hahses?

erio has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks! Here's what my data looks like:

File#1
dna:
species_1  ACCATGATACGATG
species_2  GGTTTCGACGCAGA
species_3  GGACTCAGCGACTA

File#2
morph:
species_1  001001010201001001
species_2  002010200210120201
species_4  001001110000000101
species_5  111001001001000201
[download]

If both data types exist for a species, I would like to append the morph data to the dna data. If a species is missing one of the data types, I would like to append as many "?"s as there are elements in the missing data partition:

species_1  ACCATGATACGATG001001010201001001
species_2  GGTTTCGACGCAGA002010200210120201
species_3  GGACTCAGCGACTA??????????????????
species_4  ??????????????001001110000000101
species_5  ??????????????111001001001000201
[download]

I am completely new to Perl. What is the best strategy here? I tried constructing an hash of anonymous array refs:

open IN, "dnamorph.txt" or die $!;
my %data;
my $taxon;
while (<IN>) {
    if (/(^\w+)(\t+)(\w+)/) {
        $taxon = $1;
        push @{ $data{$taxon} }, $3;
    }
}
[download]

Here, I couldn't figure out how to recognize if one of the partitions is missing and fill it with "?"s. My hash of de-refed arrays looked like this:

species_1  ACCATGATACGATG001001010201001001
species_2  GGTTTCGACGCAGA002010200210120201
species_3  GGACTCAGCGACTA
species_4  001001110000000101
species_5  111001001001000201
[download]

I also tried to make a hash of hashes for each data partition, eg:

open DNAA, "dna.nxs" or die $!;
my %ddata;
while (<DNAA>) {
    if (/(^\w+)(\t+)(\w+)/) {
        $ddata{ $1} = {
            dna => $3,
        };
    }
}
[download]

My data looks likes this:

species_six: dna=TTGGGACAGCCGAGGCACGA 
species_two: dna=AAAATCGGGCGGCGCTTTTC 
species_five: dna=TTCCAGGACATCGGCATACG 
species_three: dna=GGGGCCCCAATATCGATACG 
species_four: dna=GGGGAGGACGTAGATATTAT 
species_one: dna=ACTGTTTCGTAGGGCTAGGA 

species_two: morph=111101011011011 
species_five: morph=111101011011011 
species_three: morph=012111011011011 
species_four: morph=112111011011011 
species_one: morph=110111011011111
[download]

Here, I can't figure out how to combine the 2 hashes of hashrefs without clobbering some values. Any help would be much apreciated, erio

Comment on Combining hashes of hahses? Select or Download Code

Replies are listed 'Best First'.
Re: Combining hashes of hahses? by GrandFather (Saint) on Nov 07, 2007 at 00:37 UTC
An issue is that `push @{ $data{$taxon} }, $3;` does not make sense - you intimate that data is a hash of hash, but you are using it as a hash of array in that statement. Without more information telling us what you want to do with the data it's not clear if the data structures you are generating are appropriate. Consider the following sample however: use strict; use warnings; my $file1 = <<FILE; dna: species_1 ACCATGATACGATG species_2 GGTTTCGACGCAGA species_3 GGACTCAGCGACTA FILE my $file2 = <<FILE; morph: species_1 001001010201001001 species_2 002010200210120201 species_4 001001110000000101 species_5 111001001001000201 FILE my %data; my $dnaLen = 0; my $morphLen = 0; open IN, '<', \$file1 or die "Failed to open file1: $!"; while (<IN>) { next unless /(^\w+)\s+(\w+)/; $data{$1}{dna} = $2; $dnaLen \|\|= length $2; } close IN; open IN, '<', \$file2 or die "Failed to open file2: $!"; while (<IN>) { next unless /(^\w+)\s+(\w+)/; $data{$1}{morph} = $2; $morphLen \|\|= length $2; } close IN; die "No dna data found" unless $dnaLen; die "No morph data found" unless $morphLen; for my $species (sort keys %data) { $data{$species}{dna} \|\|= '?' x $dnaLen; $data{$species}{morph} \|\|= '?' x $morphLen; print "$species: $data{$species}{dna}$data{$species}{morph}\n"; } [download] Prints: `species_1: ACCATGATACGATG001001010201001001 species_2: GGTTTCGACGCAGA002010200210120201 species_3: GGACTCAGCGACTA?????????????????? species_4: ??????????????001001110000000101 species_5: ??????????????111001001001000201` [download] which uses a hash of hash where the primary key is the species and the secondary key is morph or dna. Perl is environmentally friendly - it saves trees	[reply] [d/l] [select]
Re^2: Combining hashes of hahses? by erio (Initiate) on Nov 07, 2007 at 18:38 UTC
Thanks GrandFather. Very helpful. Sorry to have been a bit light on context. I am trying to put DNA sequence data and coded morphological data into a file format that is accepted by a number of programs that reconstruct the evolutionary relationships amongst a group of organisms. The file will look something like this: `#nexus begin data; dimensions ntax=5 nchar=32; format datatype=mixed (dna:1-14, standard:15-32) missing=? gap=-; Matrix species_1: ACCATGATACGATG001001010201001001 species_2: GGTTTCGACGCAGA002010200210120201 species_3: GGACTCAGCGACTA?????????????????? species_4: ??????????????001001110000000101 species_5: ??????????????111001001001000201 end;` [download] The number of species and characters can be quite large.	[reply] [d/l]
Re^3: Combining hashes of hahses? by GrandFather (Saint) on Nov 07, 2007 at 19:53 UTC
In that case the hash of hash is exactly appropriate and it looks like my sample code should drop right into your application. Happy to help. Update: there is enough information to generate the header too: ;) ... my @species = sort keys %data; my $nSpecies = @species; # Print the header print "begin data;\n"; printf "dimensions ntax=%d nchar=%d;\n", $nSpecies, $dnaLen + $morphLe +n; printf "format datatype=mixed (dna:1-%d, standard:%d-%d) missing=? ga +p=-;\n", $dnaLen, $dnaLen + 1, $dnaLen + $morphLen; print "Matrix\n"; # Print the data for my $species (sort keys %data) { $data{$species}{dna} \|\|= '?' x $dnaLen; $data{$species}{morph} \|\|= '?' x $morphLen; print "$species: $data{$species}{dna}$data{$species}{morph}\n"; } print "end;\n"; [download] Prints: `begin data; dimensions ntax=5 nchar=32; format datatype=mixed (dna:1-14, standard:15-32) missing=? gap=-; Matrix species_1: ACCATGATACGATG001001010201001001 species_2: GGTTTCGACGCAGA002010200210120201 species_3: GGACTCAGCGACTA?????????????????? species_4: ??????????????001001110000000101 species_5: ??????????????111001001001000201 end;` [download] Perl is environmentally friendly - it saves trees	[reply] [d/l] [select]
Re^4: Combining hashes of hahses? by erio (Initiate) on Nov 08, 2007 at 01:03 UTC
Re^2: Combining hashes of hahses? by convenientstore (Pilgrim) on Nov 07, 2007 at 21:48 UTC
Grandfather, What is the purpose of `$dnaLen \|\|= length $2;` [download] first i was just reading to see if I can make sense of the notation of \|\|= , but I found out that it's just `$dnaLen = $dnaLen \|\| length $2;` But in this case, $dnaLen would never be anything other than 0(false).. ? am I not reading this correctly? UPDATE -- I guess it's being used here `die "No dna data found" unless $dnaLen; die "No morph data found" unless $morphLen;` [download]	[reply] [d/l] [select]
Re^3: Combining hashes of hahses? by GrandFather (Saint) on Nov 07, 2007 at 22:09 UTC
`$x \|\|= something;` is commonly used to give $x a value if it hasn't one already (more correctly, if the current value is false). In the case cited it is to pick up the first non-zero length of a dna string. There is an implicit assumption that all dna strings are the same length. Note that Perl returns the value of which ever true value it finds when evaluating \|\| (not simply a true or false value) so $x gets the value 'something' regardless of what the nature of 'something' is if $x is false to start with. In particular, this trick can be used to set a scalar to a default string if the scalar hasn't been set already: `my $error; ... $error \|\|= 'No error found';` [download] Perl is environmentally friendly - it saves trees	[reply] [d/l] [select]
Re^4: Combining hashes of hahses? by convenientstore (Pilgrim) on Nov 07, 2007 at 22:20 UTC
Re: Combining hashes of hahses? by tuxz0r (Pilgrim) on Nov 07, 2007 at 17:22 UTC
I like Grandfather's solution. Mine reads the records into a temporary structure, only to then later get the unique species and associate the dna and morph strings. Mine also had hardcoded the length of the output part (dna or morph), but I did that since I wasn't sure if they were a fixed length or if they could be variable. Easily handled as Grandfather does in his program. # Read in DNA file (file.1) my %dna = (); open my $file1, "<", "./file.1" or die "Can't open file.1: $!"; while (<$file1>) { chomp; next if $. == 1; my ($key, $val) = split /\s+/; $dna{$key} = $val; } # Read in MORPH file (file.2) my %morph = (); open my $file2, "<", "./file.2" or die "Can't open file.2: $!"; while (<$file2>) { chomp; next if $. == 1; my ($key, $val) = split /\s+/; $morph{$key} = $val; } # Get sorted, unique keys from above my %allkeys = map { $_ => 1 } (sort keys %dna, sort keys %morph); my @uniq_species = sort keys %allkeys; my %records = (); foreach (@uniq_species) { $records{$_} = (defined $dna{$_}) ? $dna{$_} : "?"x14; $records{$_} .= (defined $morph{$_}) ? $morph{$_} : "?"x18; } foreach (sort keys %records) { print "$_ $records{$_}\n"; } [download] --- `echo S 1 [ Y V U \| perl -ane 'print reverse map { $_ = chr(ord($_)-1) } @F;'` Warning: Any code posted by tuxz0r is untested, unless otherwise stated, and is used at your own risk.	[reply] [d/l]
Re^2: Combining hashes of hahses? by erio (Initiate) on Nov 07, 2007 at 18:41 UTC
Thanks tuxz0r. Learned much from both your and GrandFather's scripts. The length of both data partitions can be variable. Cheers!	[reply]