Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have a fasta file with about 400 DNA sequences. I'm trying to come up with a perl script to remove all sequences that have less than 500 nucleotides and write the >500 nucleotides sequences to a new file. Here is an example of what the fasta file looks like:

>C1_A01_R.trimmed.seq (Quality-trimmed) Agencourt Bioscience Corporation ABI

GGCCGCCAGTGTGCTGGAATCCGCCCTTAACCTGGTTGATCCCGCCAGTAGTCATACGCT CGTCTCAAAGATTAAGCCATGCATGTCTAAGTATAACTCTTTTACTTTGAAAACTGCGAA CGGCTCATTATATCAGTTATAGTTTATTTGATAGTCCCTTACTACTTGGATACCCGTAGT AATTCTAGAGCTAATACATGCATCAATACCCAACTGTTCGCGGAAGGGTAGTATTTATTA GGTATAGACCAACCGTCTTCGGACGTGCTTTGGTGATTCATAATAACTTTTCGAATCGCA TGGCTCCATGCCGGCGATGGATCATTCAAGTTTCTGCCCTATCAGCTTTGG

>C1_A03_R.trimmed.seq (Quality-trimmed) Agencourt Bioscience Corporation ABI

CCGAAGTAATTCTAGAGCTAATACATGCA

>C1_A04_R.trimmed.seq (Quality-trimmed) Agencourt Bioscience Corporation ABI

TAGTAACGGCCGCCAGTGTGCTGGAATTCGCCCTTAACCTGGTTGATCCTGCCAGTAGTC ATACGCTCGTCTCAAAGATTAGGCCATGCATGTCTAAGTATAACTCTTTTACTTTGAAAA CTGCGAACGGCTCATTATATCAGTTATAGTTTATTTGATAGTCCCTTACTACTTGGATAC CCGTAGTAATTCTAGAGCTAATACATGCATCAATACCCGACTGTTCGCGGAAGGGTAGTA TTTATTAGGTATAGACCAACCGTCTTCGGACGTGCTTTGGTGATTCATAATAACTTTTCG AATCGCATGGCTCCATGCCGGCGATGGATCATTCAAGTTTCTGCCCTATCAGCTTTGGAT GGTAGTGTATTGGACTACCATGGCTTTAACGGGTAACGAATTGTTAGGGCAAGATTTCGG AGAGGGAGCCTGAGAGACGGCTACCACATCCAAGGAAGGCAGCGGGCGCGTAAATTACCC

Does anyone out there have any scripting suggestions????

Replies are listed 'Best First'.
Re: Parse DNA fasta file
by Anonymous Monk on Oct 29, 2009 at 21:05 UTC
Re: Parse DNA fasta file
by ack (Deacon) on Oct 30, 2009 at 02:46 UTC

    I'm not sure I understand what you're looking for.

    Are you looking for how parse a FASTA file? Or are you asking about how to find the >500 nucleotide sequences?

    The previous responder, Anonymous Monk, provided an ideal strategy for easily reading FASTA files. My very first foray into learning Perl several years back was via the "Perl for Bioinformatics" book. It referenced that package; but I changed to a different learning text before I really did anything with that package. Hence, I don't know what all it does. It may also have utilities for doing things like you finding the kind of nucleotide sequences you're seeking.

    If you're looking for strategies for finding the >500 nucleotide sequences and Anonymous Monk's suggested package doesn't help, then I'd be happy to try to help. But I don't want to invest that effort if that's not what you're after.

    ack Albuquerque, NM
Re: Parse DNA fasta file
by arun_kom (Monk) on Oct 30, 2009 at 17:56 UTC

    Would have been better for you if you showed us what you came up with ... but anyway, this is one way to do it

    use Bio::SeqIO my $in = Bio::SeqIO->new(-file => "test.fa" , '-format' => 'Fasta'); my $out = Bio::SeqIO->new(-file => ">out.fa" , '-format' => 'Fasta'); while ( my $seq = $in->next_seq() ) { $out->write_seq($seq) if($seq->length > 500); }

      "Would have been better for you if you showed us what you came up with ... but anyway, this is one way to do it

      use Bio::SeqIO my $in = Bio::SeqIO->new(-file => "test.fa" , '-format' => 'Fasta'); my $out = Bio::SeqIO->new(-file => ">out.fa" , '-format' => 'Fasta'); while ( my $seq = $in->next_seq() ) { $out->write_seq($seq) if($seq->length > 500); } "

      Thank you very much for suggesting this script and Bioperl. I have downloaded Bioperl and tried to get the script work with my FASTA file but it returns:

      "Can't call method "next_seq" on an undefined value at parse_sequence_length.pl line 6".

      Any suggestions?

      Thanks!

        Check the return value from the constructor invocations and see what that tells you:

        my $in = Bio::SeqIO->new(-file => "test.fa" , '-format' => 'Fasta') or die "Failed to open input file: $!";

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        Probably your input fasta file cannot be found as it is not in the path ... try specifying the entire path to your file instead of the file name alone ... i missed it in my example code but BrowserUk's extension to it will help identify any error and is a good practice to always check so.

Re: Parse DNA fasta file
by BioLion (Curate) on Oct 30, 2009 at 17:27 UTC

    Bio::SeqIO::Fasta is the most commonly used and better supported fasta munging modules, and it returns Bio::Seq object for each FASTA entry, which has methods for determining length amongst a lot of other things.

    It should be fairly simple, but post again with your code if you encounter any other problems.

    Just a something something...