in reply to Extract subset of sequences from a FASTA file

I therefore wanted to create a file that randomly selects 1000 sequences, and then put these in a new FASTA file.

The reason for all the questions is that some approaches are better than other depending upon your answers.

For example, if you have all the sequences in a single file, then there is a method of making a random selection of those records, without loading them all into memory first.

Which if the combined total is very large -- ie. a few million longish sequences; or a few billion shorter ones -- avoiding having to load them all at once can be very convenient.

Another example. The Bio* FASTA file handling modules aren't very convenient for your purpose because they only allow you to iterate over the sequences sequentially, or access by ID; not randomly. So at the very least you would have to iterate over the file(s) and copy all the ids into a secondary data-structure -- an array say -- in order to make your random selection.

It's also not clear from your description whether you are selecting a single file of 1000 sequences across all the primer sets; or 1000 from each primer set.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

  • Comment on Re: Extract subset of sequences from a FASTA file