silver_steve has asked for the wisdom of the Perl Monks concerning the following question:

I have sequence data that was created using several different primers. I found that the samples have large variation in sequences, depending on the primer used. I therefore wanted to create a file that randomly selects 1000 sequences, and then put these in a new FASTA file. My strategy is to put the sequences in a hash, split by the '>'. Then, assign each a number, and then randomly select the numbers that are assigned to the sequences, and then put this in a new FASTA file. I haven't gotten this to work yet, but does anyone know a simpler method for doing this?
  • Comment on Extract subset of sequences from a FASTA file

Replies are listed 'Best First'.
Re: Extract subset of sequences from a FASTA file
by CountZero (Bishop) on Jan 13, 2012 at 20:20 UTC
    Easy.

    Put the sequences in an array, shuffle the array and then take the first 1000 elements.

    use Modern::Perl; use List::Util qw/shuffle/; my @sequences = (shuffle <DATA>)[0 .. 9]; chomp @sequences; say "@sequences" __DATA__ first second third fourth fifth sixth seventh eight ninth tenth eleventh twelveth thirteenth fourteenth fifteenth sixteenth seventeenth eighteenth nineteenth twentieth
    Or, taking advantage of the pseudo-random order of the hash-keys:
    my %sequences = map {chomp; $_ => 1} <DATA>; my @sequences = (keys %sequences)[0 ..9]; say "@sequences";

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Extract subset of sequences from a FASTA file
by BrowserUk (Patriarch) on Jan 13, 2012 at 20:24 UTC
    I therefore wanted to create a file that randomly selects 1000 sequences, and then put these in a new FASTA file.
    • How many sequences are you starting with?
    • What is their total combined size?
    • Are they all in a single file, or in separate files?

      (say, one per primer)?

    • How are you reading and writing your files?

      Using standard file handling -- readline & print -- or some specialist module?

    The reason for all the questions is that some approaches are better than other depending upon your answers.

    For example, if you have all the sequences in a single file, then there is a method of making a random selection of those records, without loading them all into memory first.

    Which if the combined total is very large -- ie. a few million longish sequences; or a few billion shorter ones -- avoiding having to load them all at once can be very convenient.

    Another example. The Bio* FASTA file handling modules aren't very convenient for your purpose because they only allow you to iterate over the sequences sequentially, or access by ID; not randomly. So at the very least you would have to iterate over the file(s) and copy all the ids into a secondary data-structure -- an array say -- in order to make your random selection.

    It's also not clear from your description whether you are selecting a single file of 1000 sequences across all the primer sets; or 1000 from each primer set.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Extract subset of sequences from a FASTA file
by InfiniteSilence (Curate) on Jan 14, 2012 at 18:26 UTC

    We basically get questions like this from time to time here on Perlmonks regarding stuff that is probably already being done somewhere in BioPerl. It is not that we don't like answering Perl questions but you might be better served Googling 'bioperl forum.'

    Celebrate Intellectual Diversity

      BioPerl is usually an over-complicated mess. You want to generate a random set of indices from a range (with or without replacement?), then get the sequences at those indices. BioPerl won't help you do this.

        I was going to disagree with you but I figured that I should try to download the BioPerl core and try this out. Well, that was something like 1/2 hour ago. Apparently the 'core' is stored on github rather than CPAN and that may be the problem. If I just want a handful of packages to solve a particular problem I probably should not have to download this huge thing. If I had known this before I probably would not have made this suggestion.

        Celebrate Intellectual Diversity