Extract subset of sequences from a FASTA file

silver_steve has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Extract subset of sequences from a FASTA file by CountZero (Bishop) on Jan 13, 2012 at 20:20 UTC
Easy. Put the sequences in an array, shuffle the array and then take the first 1000 elements. `use Modern::Perl; use List::Util qw/shuffle/; my @sequences = (shuffle <DATA>)[0 .. 9]; chomp @sequences; say "@sequences" __DATA__ first second third fourth fifth sixth seventh eight ninth tenth eleventh twelveth thirteenth fourteenth fifteenth sixteenth seventeenth eighteenth nineteenth twentieth` [download] Or, taking advantage of the pseudo-random order of the hash-keys: `my %sequences = map {chomp; $_ => 1} <DATA>; my @sequences = (keys %sequences)[0 ..9]; say "@sequences";` [download] CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re: Extract subset of sequences from a FASTA file by BrowserUk (Patriarch) on Jan 13, 2012 at 20:24 UTC
I therefore wanted to create a file that randomly selects 1000 sequences, and then put these in a new FASTA file. How many sequences are you starting with? What is their total combined size? Are they all in a single file, or in separate files? (say, one per primer)? How are you reading and writing your files? Using standard file handling -- readline & print -- or some specialist module? The reason for all the questions is that some approaches are better than other depending upon your answers. For example, if you have all the sequences in a single file, then there is a method of making a random selection of those records, without loading them all into memory first. Which if the combined total is very large -- ie. a few million longish sequences; or a few billion shorter ones -- avoiding having to load them all at once can be very convenient. Another example. The Bio* FASTA file handling modules aren't very convenient for your purpose because they only allow you to iterate over the sequences sequentially, or access by ID; not randomly. So at the very least you would have to iterate over the file(s) and copy all the ids into a secondary data-structure -- an array say -- in order to make your random selection. It's also not clear from your description whether you are selecting a single file of 1000 sequences across all the primer sets; or 1000 from each primer set. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re: Extract subset of sequences from a FASTA file by InfiniteSilence (Curate) on Jan 14, 2012 at 18:26 UTC
We basically get questions like this from time to time here on Perlmonks regarding stuff that is probably already being done somewhere in BioPerl. It is not that we don't like answering Perl questions but you might be better served Googling 'bioperl forum.' Celebrate Intellectual Diversity	[reply]
Re^2: Extract subset of sequences from a FASTA file by educated_foo (Vicar) on Jan 14, 2012 at 23:46 UTC
BioPerl is usually an over-complicated mess. You want to generate a random set of indices from a range (with or without replacement?), then get the sequences at those indices. BioPerl won't help you do this.	[reply]
Re^3: Extract subset of sequences from a FASTA file by InfiniteSilence (Curate) on Jan 19, 2012 at 03:34 UTC
I was going to disagree with you but I figured that I should try to download the BioPerl core and try this out. Well, that was something like 1/2 hour ago. Apparently the 'core' is stored on github rather than CPAN and that may be the problem. If I just want a handful of packages to solve a particular problem I probably should not have to download this huge thing. If I had known this before I probably would not have made this suggestion. Celebrate Intellectual Diversity	[reply]