in reply to Re: unique sequences
in thread unique sequences

I'd say the best place for [AGCT] regex check is as close as possible to reading the input file. The less unvalidated data inside the perimeter, the better.

Replies are listed 'Best First'.
Re^3: unique sequences
by kcott (Archbishop) on Dec 12, 2017 at 01:35 UTC

    In general, I'd agree with that. However, biological data can be huge and the least amount of processing you can get away with, the better.

    I'd probably recommend validating the fasta file once, then setting it to "read-only". Perhaps additional checks to ensure it hasn't changed since validation might be in order. It can then be used multiple times with some reasonable degree of confidence about the data integrity.

    Obviously, at this point, we don't know the source of the input, or even what it looks like, so validation requirements are purely guesswork.

    — Ken