You said:
i would like to compare the entire 'data tree' generated from one source of data ... to another 'data tree' from what should be equivalent source of data to make sure they are the same.

I suppose your two data sources are likely to have some range of "irrelevant" differences (e.g. different amounts or placements of whitespace or "comments" that have no impact on whether or not the data of interest is "identical"). I mean, if that were not an issue, you could just use the unix "diff" command on the two data sources.

BrowserUK and Grandfather have each given a valid and workable approach; the first is more a matter of expedience, getting directly to a specific result that you want, while the second is more a matter of strategic coding, setting up an infrastructure that can easily be expanded to handle additional tasks for data of this type, without the overall code base getting too messy and difficult to maintain as more functions and conditions are added.

There are a couple more alternatives that come to mind, one being another expedient, and the other being another strategic plan:

  1. Figure out a relatively simple, minimal process for conditioning your two sources into a consistent format, removing irrelevant differences in data content. Once you convert each of the inputs to a consistent, comparable form, a simple "diff" operation will suffice to say whether they are the same, and will show how they differ if they aren't the same. The kinds of data conversions you're likely to need may be very fast and use very little memory -- you're actually just "stream editing" each input file to create comparable data.

  2. Create a set of relational tables in a SQL-accessible database, load your source data into "snp", "gene" and "transcript" tables as appropriate, and use queries to check for differences. This is potentially the most demanding approach, but it offers lots of flexibility for sustainable elaboration later on; add fields to the tables as needed, come up with a wider assortment of queries to answer questions you haven't thought of yet, etc.

Making a choice among all these approaches is a matter of deciding how much you need some kind of infrastructure that will accommodate new tasks/problems that might come up later, vs. how important it is to get a specific task done sooner rather than later.

In reply to Re: can i avoid all these nested hashes by graff
in thread can i avoid all these nested hashes by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.