erio has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks! Here's what my data looks like:
File#1 dna: species_1 ACCATGATACGATG species_2 GGTTTCGACGCAGA species_3 GGACTCAGCGACTA File#2 morph: species_1 001001010201001001 species_2 002010200210120201 species_4 001001110000000101 species_5 111001001001000201
If both data types exist for a species, I would like to append the morph data to the dna data. If a species is missing one of the data types, I would like to append as many "?"s as there are elements in the missing data partition:
species_1 ACCATGATACGATG001001010201001001 species_2 GGTTTCGACGCAGA002010200210120201 species_3 GGACTCAGCGACTA?????????????????? species_4 ??????????????001001110000000101 species_5 ??????????????111001001001000201
I am completely new to Perl. What is the best strategy here? I tried constructing an hash of anonymous array refs:
open IN, "dnamorph.txt" or die $!; my %data; my $taxon; while (<IN>) { if (/(^\w+)(\t+)(\w+)/) { $taxon = $1; push @{ $data{$taxon} }, $3; } }
Here, I couldn't figure out how to recognize if one of the partitions is missing and fill it with "?"s. My hash of de-refed arrays looked like this:
species_1 ACCATGATACGATG001001010201001001 species_2 GGTTTCGACGCAGA002010200210120201 species_3 GGACTCAGCGACTA species_4 001001110000000101 species_5 111001001001000201
I also tried to make a hash of hashes for each data partition, eg:
open DNAA, "dna.nxs" or die $!; my %ddata; while (<DNAA>) { if (/(^\w+)(\t+)(\w+)/) { $ddata{ $1} = { dna => $3, }; } }
My data looks likes this:
species_six: dna=TTGGGACAGCCGAGGCACGA species_two: dna=AAAATCGGGCGGCGCTTTTC species_five: dna=TTCCAGGACATCGGCATACG species_three: dna=GGGGCCCCAATATCGATACG species_four: dna=GGGGAGGACGTAGATATTAT species_one: dna=ACTGTTTCGTAGGGCTAGGA species_two: morph=111101011011011 species_five: morph=111101011011011 species_three: morph=012111011011011 species_four: morph=112111011011011 species_one: morph=110111011011111
Here, I can't figure out how to combine the 2 hashes of hashrefs without clobbering some values. Any help would be much apreciated, erio

Replies are listed 'Best First'.
Re: Combining hashes of hahses?
by GrandFather (Saint) on Nov 07, 2007 at 00:37 UTC

    An issue is that push @{ $data{$taxon} }, $3; does not make sense - you intimate that data is a hash of hash, but you are using it as a hash of array in that statement.

    Without more information telling us what you want to do with the data it's not clear if the data structures you are generating are appropriate. Consider the following sample however:

    use strict; use warnings; my $file1 = <<FILE; dna: species_1 ACCATGATACGATG species_2 GGTTTCGACGCAGA species_3 GGACTCAGCGACTA FILE my $file2 = <<FILE; morph: species_1 001001010201001001 species_2 002010200210120201 species_4 001001110000000101 species_5 111001001001000201 FILE my %data; my $dnaLen = 0; my $morphLen = 0; open IN, '<', \$file1 or die "Failed to open file1: $!"; while (<IN>) { next unless /(^\w+)\s+(\w+)/; $data{$1}{dna} = $2; $dnaLen ||= length $2; } close IN; open IN, '<', \$file2 or die "Failed to open file2: $!"; while (<IN>) { next unless /(^\w+)\s+(\w+)/; $data{$1}{morph} = $2; $morphLen ||= length $2; } close IN; die "No dna data found" unless $dnaLen; die "No morph data found" unless $morphLen; for my $species (sort keys %data) { $data{$species}{dna} ||= '?' x $dnaLen; $data{$species}{morph} ||= '?' x $morphLen; print "$species: $data{$species}{dna}$data{$species}{morph}\n"; }

    Prints:

    species_1: ACCATGATACGATG001001010201001001 species_2: GGTTTCGACGCAGA002010200210120201 species_3: GGACTCAGCGACTA?????????????????? species_4: ??????????????001001110000000101 species_5: ??????????????111001001001000201

    which uses a hash of hash where the primary key is the species and the secondary key is morph or dna.


    Perl is environmentally friendly - it saves trees
      Thanks GrandFather. Very helpful. Sorry to have been a bit light on context. I am trying to put DNA sequence data and coded morphological data into a file format that is accepted by a number of programs that reconstruct the evolutionary relationships amongst a group of organisms. The file will look something like this:
      #nexus begin data; dimensions ntax=5 nchar=32; format datatype=mixed (dna:1-14, standard:15-32) missing=? gap=-; Matrix species_1: ACCATGATACGATG001001010201001001 species_2: GGTTTCGACGCAGA002010200210120201 species_3: GGACTCAGCGACTA?????????????????? species_4: ??????????????001001110000000101 species_5: ??????????????111001001001000201 end;
      The number of species and characters can be quite large.

        In that case the hash of hash is exactly appropriate and it looks like my sample code should drop right into your application. Happy to help.

        Update: there is enough information to generate the header too: ;)

        ... my @species = sort keys %data; my $nSpecies = @species; # Print the header print "begin data;\n"; printf "dimensions ntax=%d nchar=%d;\n", $nSpecies, $dnaLen + $morphLe +n; printf "format datatype=mixed (dna:1-%d, standard:%d-%d) missing=? ga +p=-;\n", $dnaLen, $dnaLen + 1, $dnaLen + $morphLen; print "Matrix\n"; # Print the data for my $species (sort keys %data) { $data{$species}{dna} ||= '?' x $dnaLen; $data{$species}{morph} ||= '?' x $morphLen; print "$species: $data{$species}{dna}$data{$species}{morph}\n"; } print "end;\n";

        Prints:

        begin data; dimensions ntax=5 nchar=32; format datatype=mixed (dna:1-14, standard:15-32) missing=? gap=-; Matrix species_1: ACCATGATACGATG001001010201001001 species_2: GGTTTCGACGCAGA002010200210120201 species_3: GGACTCAGCGACTA?????????????????? species_4: ??????????????001001110000000101 species_5: ??????????????111001001001000201 end;

        Perl is environmentally friendly - it saves trees
      Grandfather, What is the purpose of
      $dnaLen ||= length $2;
      first i was just reading to see if I can make sense of the notation of ||= , but I found out that it's just $dnaLen = $dnaLen || length $2; But in this case, $dnaLen would never be anything other than 0(false).. ? am I not reading this correctly?

      UPDATE -- I guess it's being used here
      die "No dna data found" unless $dnaLen; die "No morph data found" unless $morphLen;

        $x ||= something; is commonly used to give $x a value if it hasn't one already (more correctly, if the current value is false). In the case cited it is to pick up the first non-zero length of a dna string. There is an implicit assumption that all dna strings are the same length.

        Note that Perl returns the value of which ever true value it finds when evaluating || (not simply a true or false value) so $x gets the value 'something' regardless of what the nature of 'something' is if $x is false to start with. In particular, this trick can be used to set a scalar to a default string if the scalar hasn't been set already:

        my $error; ... $error ||= 'No error found';

        Perl is environmentally friendly - it saves trees
Re: Combining hashes of hahses?
by tuxz0r (Pilgrim) on Nov 07, 2007 at 17:22 UTC
    I like Grandfather's solution. Mine reads the records into a temporary structure, only to then later get the unique species and associate the dna and morph strings. Mine also had hardcoded the length of the output part (dna or morph), but I did that since I wasn't sure if they were a fixed length or if they could be variable. Easily handled as Grandfather does in his program.
    # Read in DNA file (file.1) my %dna = (); open my $file1, "<", "./file.1" or die "Can't open file.1: $!"; while (<$file1>) { chomp; next if $. == 1; my ($key, $val) = split /\s+/; $dna{$key} = $val; } # Read in MORPH file (file.2) my %morph = (); open my $file2, "<", "./file.2" or die "Can't open file.2: $!"; while (<$file2>) { chomp; next if $. == 1; my ($key, $val) = split /\s+/; $morph{$key} = $val; } # Get sorted, unique keys from above my %allkeys = map { $_ => 1 } (sort keys %dna, sort keys %morph); my @uniq_species = sort keys %allkeys; my %records = (); foreach (@uniq_species) { $records{$_} = (defined $dna{$_}) ? $dna{$_} : "?"x14; $records{$_} .= (defined $morph{$_}) ? $morph{$_} : "?"x18; } foreach (sort keys %records) { print "$_ $records{$_}\n"; }

    ---
    echo S 1 [ Y V U | perl -ane 'print reverse map { $_ = chr(ord($_)-1) } @F;'
    Warning: Any code posted by tuxz0r is untested, unless otherwise stated, and is used at your own risk.

      Thanks tuxz0r. Learned much from both your and GrandFather's scripts. The length of both data partitions can be variable. Cheers!