in reply to Formatting clue
As a rule, don't trust data. Or rather, trust but verify.
I recommend that you put in sanity checks to confirm that the first line, the last line, and everything in between, is what you expect. For example, what happens if you get a first column that does not match the pattern /Chr\d+/ (aka chromosome21), but instead get ChrX, or Chr?, or ? instead? What happens if instead of getting a reference sequence id (e.g. NT_113958) that you can pass on to the genome browser, you get something unique to the research organization providing the data (e.g. adhoc_1234)? Those are just two examples of gotchas that bit me when processing supposedly well formed genetics data.
I raise specific issues around bioinformations, but the mindset applies to all processing. If you have source data assumptions, have your program verify them.
Scott\b
|
|---|