Re: can i avoid all these nested hashes

You said:

i would like to compare the entire 'data tree' generated from one source of data ... to another 'data tree' from what should be equivalent source of data to make sure they are the same.

I suppose your two data sources are likely to have some range of "irrelevant" differences (e.g. different amounts or placements of whitespace or "comments" that have no impact on whether or not the data of interest is "identical"). I mean, if that were not an issue, you could just use the unix "diff" command on the two data sources.

BrowserUK and Grandfather have each given a valid and workable approach; the first is more a matter of expedience, getting directly to a specific result that you want, while the second is more a matter of strategic coding, setting up an infrastructure that can easily be expanded to handle additional tasks for data of this type, without the overall code base getting too messy and difficult to maintain as more functions and conditions are added.

There are a couple more alternatives that come to mind, one being another expedient, and the other being another strategic plan:

Figure out a relatively simple, minimal process for conditioning your two sources into a consistent format, removing irrelevant differences in data content. Once you convert each of the inputs to a consistent, comparable form, a simple "diff" operation will suffice to say whether they are the same, and will show how they differ if they aren't the same. The kinds of data conversions you're likely to need may be very fast and use very little memory -- you're actually just "stream editing" each input file to create comparable data.
Create a set of relational tables in a SQL-accessible database, load your source data into "snp", "gene" and "transcript" tables as appropriate, and use queries to check for differences. This is potentially the most demanding approach, but it offers lots of flexibility for sustainable elaboration later on; add fields to the tables as needed, come up with a wider assortment of queries to answer questions you haven't thought of yet, etc.

Making a choice among all these approaches is a matter of deciding how much you need some kind of infrastructure that will accommodate new tasks/problems that might come up later, vs. how important it is to get a specific task done sooner rather than later.

Comment on Re: can i avoid all these nested hashes

Replies are listed 'Best First'.
Re^2: can i avoid all these nested hashes by Anonymous Monk on Dec 16, 2010 at 01:17 UTC
very helpful answers from everyone - thanks	[reply]