in reply to Extract subset of sequences from a FASTA file
I therefore wanted to create a file that randomly selects 1000 sequences, and then put these in a new FASTA file.
(say, one per primer)?
Using standard file handling -- readline & print -- or some specialist module?
The reason for all the questions is that some approaches are better than other depending upon your answers.
For example, if you have all the sequences in a single file, then there is a method of making a random selection of those records, without loading them all into memory first.
Which if the combined total is very large -- ie. a few million longish sequences; or a few billion shorter ones -- avoiding having to load them all at once can be very convenient.
Another example. The Bio* FASTA file handling modules aren't very convenient for your purpose because they only allow you to iterate over the sequences sequentially, or access by ID; not randomly. So at the very least you would have to iterate over the file(s) and copy all the ids into a secondary data-structure -- an array say -- in order to make your random selection.
It's also not clear from your description whether you are selecting a single file of 1000 sequences across all the primer sets; or 1000 from each primer set.
|
|---|