Re: Size of sequences in fastafile

G'day Sofie,

I normally handle fasta files by making the record terminator the '>'; note that the first record is a zero-length string which you'll typically want to discard. In your particular case here, you can split that record on a newline: the sequence will be the second element. Look at the code below and then some explanatory notes at the end.

#!/usr/bin/env perl

use strict;
use warnings;

use constant MIN_LENGTH => 10;

my $long_seqs = 0;

{
    local $/ = '>';

    while (<DATA>) {
        next if $. == 1;
        my $seq = (split /\n/)[1];
        next if MIN_LENGTH > length $seq;
        ++$long_seqs;
        print $seq, "\n";
    }
}

if (not $long_seqs) {
    print 'No sequence is over ', MIN_LENGTH-1, "\n";
}

__DATA__
>NM_001 Homo sapiens ADA2 (CECR1)
GATCCAA
>NM_002 Homo sapiens IKBKG
GGAGGTCTTTAGCTTTAGGGAAACCC
> sequence length = 9 -- should be discarded
ABCDEFGHI
> sequence length = 10 -- should be kept
ABCDEFGHIJ
> sequence length = 11 -- should be kept
ABCDEFGHIJK
[download]

Sample run:

$ ./pm_11113575_fasta_extract.pl
GGAGGTCTTTAGCTTTAGGGAAACCC
ABCDEFGHIJ
ABCDEFGHIJK
[download]

Changing the MIN_LENGTH constant to 100:

$ ./pm_11113575_fasta_extract.pl
No sequence is over 99
[download]

Notes:

See "perlvar: Variables related to filehandles" for an explanation of '$/', '$.', and why I've put the local and while in an anonymous block.
See "local" for more about that function; that page has links to further information.
Purely for demonstration purposes, I've put the data under __DATA__ and read it with the DATA filehandle (see "perldata: Special Literals" for more about those). Your code will have something more like '(<$fh>)'.
See "open" for a better way to open your files: using lexical filehandles and the 3-argument form of that command.
Your error reporting leaves much to be desired: which file couldn't open? why couldn't it open? It might be fine for a tiny test script like this; however, that inevitably gets expanded and built upon — then you can end up with bugs that are hard to track down (exacerbated by the use of package variables for filehandles). Take a look at the "autodie" pragma: with this you don't have to (remember to) write 'or die "..."' (no point doing extra work when Perl can do it for you) and you also get much better reporting of which file had what problem.
Note that I extended your test data — yes, I know those aren't valid sequences but they were handier for quickly checking results than random strings of nucleotides. Off-by-one errors are extremely common and very easy to make: you actually have one in the code you posted ("... No sequence is over ...") — compare with my code to spot the problem. In this case, using test (pseudo-)sequences that are 'MIN_LENGTH-1', 'MIN_LENGTH' and 'MIN_LENGTH+1', you can quickly catch problems like that.
Don't use the '-w' switch: see "perlrun: -w" for the reason. Always use the strict and warnings pragmata, as I've shown in my code: see "perlintro: Safety net" for the reason.

— Ken

Comment on Re: Size of sequences in fastafile Select or Download Code


more useful options
	PerlMonks