Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

I am a perl beginner and I have to model some quite complicated domain knowledge and with my limited perl knowledge the best I can come up with is lots of nested hashes. Here it is:

A snp is something in biology which can be associated with a specific set of genes. Each gene in turn can be associated with a specific set of transcripts. Each transcript also has a set of associated properties that we care about in relation to the original snp. This was my data model

A 'snp hash'. The key is the snp id, the value is a ref to a 'gene has +h'. Each key in this hash is a gene id and the value is a ref to a transcr +ipt hash. Each key is a transcript id and the return value is a hash ref to a pr +operties hash. The hash has one key per property of the transcript
i would like to compare the entire 'data tree' generated from one source of data one source to another 'data tree' from what should be equivalent source of data to make sure they are the same. So any solution should aid comparision. I think this nested hash structure allows me to compare, say, the genes of the same snp as I can compare the keys of the corresponding gene hash ref.

Any comments welecomed

Replies are listed 'Best First'.
Re: can i avoid all these nested hashes
by graff (Chancellor) on Dec 13, 2010 at 04:17 UTC
    You said:
    i would like to compare the entire 'data tree' generated from one source of data ... to another 'data tree' from what should be equivalent source of data to make sure they are the same.

    I suppose your two data sources are likely to have some range of "irrelevant" differences (e.g. different amounts or placements of whitespace or "comments" that have no impact on whether or not the data of interest is "identical"). I mean, if that were not an issue, you could just use the unix "diff" command on the two data sources.

    BrowserUK and Grandfather have each given a valid and workable approach; the first is more a matter of expedience, getting directly to a specific result that you want, while the second is more a matter of strategic coding, setting up an infrastructure that can easily be expanded to handle additional tasks for data of this type, without the overall code base getting too messy and difficult to maintain as more functions and conditions are added.

    There are a couple more alternatives that come to mind, one being another expedient, and the other being another strategic plan:

    1. Figure out a relatively simple, minimal process for conditioning your two sources into a consistent format, removing irrelevant differences in data content. Once you convert each of the inputs to a consistent, comparable form, a simple "diff" operation will suffice to say whether they are the same, and will show how they differ if they aren't the same. The kinds of data conversions you're likely to need may be very fast and use very little memory -- you're actually just "stream editing" each input file to create comparable data.

    2. Create a set of relational tables in a SQL-accessible database, load your source data into "snp", "gene" and "transcript" tables as appropriate, and use queries to check for differences. This is potentially the most demanding approach, but it offers lots of flexibility for sustainable elaboration later on; add fields to the tables as needed, come up with a wider assortment of queries to answer questions you haven't thought of yet, etc.

    Making a choice among all these approaches is a matter of deciding how much you need some kind of infrastructure that will accommodate new tasks/problems that might come up later, vs. how important it is to get a specific task done sooner rather than later.
      very helpful answers from everyone - thanks
Re: can i avoid all these nested hashes
by BrowserUk (Patriarch) on Dec 12, 2010 at 23:54 UTC

    Often, depending upon other usage, it makes sense to avoid deeply nested hashes by combining the keys into a single key.

    Ie: instead of:

    my %snp; for my $snpId ( ... ) { for my $geneId ( ... ) { for my $transcriptId ( ... ) { for my $propertyId ( ... ) { $snp{ $snpId }{ $geneId }{ $TranscriptId }{ $propertyI +d } = ...; } } } }

    Use:

    my %snp; for my $snpId ( ... ) { for my $geneId ( ... ) { for my $transcriptId ( ... ) { for my $propertyId ( ... ) { $snp{ join $;, $snpId, $geneId, $TranscriptId, $proper +tyId } = ...; } } } }

    The result is a single level hash with very similar selection properties to the multilevel hash, that uses far less memory, is faster to access, and far, far easier to compare one with another.

    $; is a control character--ascii 28--which is unlikely to appear in any normal text hash key, is used to join the keys together to form a composite key. In the unlikely event that there is the possibility of your Ids containing that character, then there are other characters that can be used to delimit the join that may be preferable. Eg, the null character (chr(0)) or chr(255) (sometimes labelled DEL), which may be better candidates. If your ids can be unicode, you may need to be more selective.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      that's interesting - i will consider combininng keys i could certainly combine snpid, geneid and transcript id into a key and maybe have that hash entry return a hash ref to the properties
Re: can i avoid all these nested hashes
by GrandFather (Saint) on Dec 13, 2010 at 00:41 UTC

    It may help to use objects instead of explicit hashes. The hashes are still effectively there and the nested structure is still effectively there, but you can provide methods that give friendly ways of managing the data that hide the underlying complexity. Consider:

    #!/usr/bin/perl use strict; use warnings; package SNP; sub new { my ($class, %params) = @_; die "id parameter required by $class constructor\n" if !exists $pa +rams{id}; $params{genes} ||= {}; return bless \%params, $class; } sub addGene { my ($self, $geneId, %params) = @_; $self->{genes}{$geneId} ||= Gene->new(id => $geneId, %params); return $self->{genes}{$geneId}; } sub match { my ($self, $other) = @_; my @otherGenes = sort keys %{$other->{genes}}; my @genes = sort keys %{$self->{genes}}; my $result = ''; return "$self->{id} and $other->{id} differ in number of genes.\n" if @otherGenes != @genes; # different number of transcripts # Check all genes match for my $gene (@genes) { my $matchFail = $self->{genes}{$gene}->match($other->{genes}{$ +gene}); next if !$matchFail; $result .= "- Gene mismatch for $gene:\n"; $result .= $matchFail; } if ($result) { $result = "$self->{id} does not match $other->{id}:\n" . $resu +lt; } return $result; } package Gene; sub new { my ($class, %params) = @_; die "id parameter required by $class constructor\n" if !exists $pa +rams{id}; $params{trans} ||= {}; return bless \%params, $class; } sub addTranscript { my ($self, $transId, %params) = @_; $self->{trans}{$transId} ||= Transcript->new(id => $transId, %para +ms); return $self->{trans}{$transId}; } sub match { my ($self, $other) = @_; my @otherTrans = sort keys %{$other->{trans}}; my @trans = sort keys %{$self->{trans}}; my $result = ''; return "$self->{id} and $other->{id} differ in number of transacti +ons\n" if @otherTrans != @trans; # different number of transcripts # Check all transcripts match for my $transName (@trans) { my $matchFail = $self->{trans}{$transName}->match($other->{trans}{$transNa +me}); next if !$matchFail; $result .= "-- Transcript mismatch for $transName:\n"; $result .= $matchFail; } return $result; } package Transcript; sub new { my ($class, %params) = @_; die "id parameter required by $class constructor\n" if !exists $pa +rams{id}; $params{props} ||= {}; return bless \%params, $class; } sub setProp { my ($self, $prop, $value) = @_; $self->{props}{$prop} = $value; } sub match { my ($self, $other) = @_; my @otherProps = sort keys %{$other->{props}}; my @props = sort keys %{$self->{props}}; my $result = ''; return if @otherProps != @props; # different number of properti +es # Check all properties match for my $propName (@props) { if (!defined $other->{props}{$propName}) { $result .= "$self->{id} has $propName but $other->{id} doe +sn't\n"; next; } if ($self->{props}{$propName} ne $other->{props}{$propName}) { $result .= "--- $self->{id} and $other->{id} property $propName d +iffers:\n"; $result .= " '$self->{props}{$propName}' and '$other->{props}{ +$propName}'\n"; next; } } return $result; } package main; my $snp1 = SNP->new(id => 'SNP1'); my $gene = $snp1->addGene('Gene1'); my $trans = $gene->addTranscript('Trans1'); $trans->setProp(big => 1); $trans->setProp(color => 'blue'); $gene = $snp1->addGene('Gene2'); $trans = $gene->addTranscript('Trans2'); $trans->setProp(big => 1); $trans->setProp(color => 'green'); my $snp2 = SNP->new(id => 'SNP2'); $gene = $snp2->addGene('Gene1'); $trans = $gene->addTranscript('Trans1'); $trans->setProp(big => 1); $trans->setProp(color => 'blue'); $gene = $snp2->addGene('Gene2'); $trans = $gene->addTranscript('Trans2'); $trans->setProp(big => 1); $trans->setProp(color => 'blue'); print $snp1->match($snp2);

    Prints:

    SNP1 does not match SNP2: - Gene mismatch for Gene2: -- Transcript mismatch for Trans2: --- Trans2 and Trans2 property color differs: 'green' and 'blue'
    True laziness is hard work

      That's clumbsier, slower, and uses 4 times as much memory.

      You're sugesting replacing one line of built-in code

      $snp{'Gene1'}{'Trans1'}{'color'} = 'blue';

      with

      my $snp1 = SNP->new(id => 'SNP1'); my $gene = $snp1->addGene('Gene1'); my $trans = $gene->addTranscript('Trans1'); $trans->setProp(big => 1);

      Plus 70 lines of code that does nothing more than replicate built-in functionality. More code, means more bugs.

      The very definition of assinine.

        At first glance, what Grandfather did above looks a lot like "copy/paste" coding, and maybe even "OOP for OOP's sake". But you don't have to look all that closely at his code, and you don't have think very long about the domain of the problem, to understand that Grandfather is actually making a smart investment up front, in anticipation of what could end up being a fairly diverse set of problems to be addressed.

        I'm all for using expedients whenever possible, and I would have probably used BrowserUK's approach myself, if I knew that there was just the one question to be answered (and quickly) about the data. But putting in a foundation for future work is not "assinine" [sic].