Size of sequences in fastafile

Sofie has asked for the wisdom of the Perl Monks concerning the following question:

Hi I have a fasta file containing a number of sequences. I would like to extract only the sequences that are of a specific length (lets's say >10 nt). The sequences are wrapped with new lines, so when trying to read into an array, each line becomes an element. How can I removed these? I have tried join, but it also removes the newline between sequences and puts all of it into single string. How do I separate each sequence into an element or string?

The file looks something like this:


>NM_001 Homo sapiens ADA2 (CECR1)
GATCCAA
>NM_002 Homo sapiens IKBKG
GGAGGTCTTTAGCTTTAGGGAAACCC
[download]

Output should be only the second sequence in this case.

#!/usr/bin/perl -w


#open the fastfile Genes.fasta
open (GENES, "Genes.fasta") or die "Could not open file";
chomp (@seq = <GENES>);
$seq = join ("\n", @seq);

$lengthseq = length $seq;

#min length of the seq
$minlength = 10;

#if length is over a certain size, print
if ($lengthseq > $minlength){
    print $seq;
    }
else {print "No sequence is over $minlength;"
}
[download]

Comment on Size of sequences in fastafile Select or Download Code

Replies are listed 'Best First'.
Re: Size of sequences in fastafile by zubenel0 (Sexton) on Mar 01, 2020 at 07:18 UTC
Tasks like this can be solved by using BioPerl. `use strict; use warnings; use Bio::SeqIO; # Reading the fasta file my $seqio_obj = Bio::SeqIO->new(-file => "Genes.fasta", -format => "fasta" ); # Looping through sequences and printing them unless length <= 10 + while ( my $seq_obj = $seqio_obj->next_seq ) { print $seq_obj->seq,"\n" unless $seq_obj->length <= 10; } # The output should be : "GGAGGTCTTTAGCTTTAGGGAAACCC".` [download]	[reply] [d/l]
Re: Size of sequences in fastafile by hippo (Bishop) on Feb 29, 2020 at 14:08 UTC
It is difficult to know quite what data structure and form you want from the description (See I know what I mean. Why don't you?). I've assumed you want an array formed with the even lines appended to the odd lines with a space between. Here's an SSCCE to demonstrate one technique for this. `use strict; use warnings; use Test::More tests => 1; my @seq; my @want = ( '>NM_001 Homo sapiens ADA2 (CECR1) GATCCAA', '>NM_002 Homo sapiens IKBKG GGAGGTCTTTAGCTTTAGGGAAACCC', ); while (<DATA>) { chomp; if (/^>/) { push @seq, "$_ "; } else { $seq[-1] .= $_; } } is_deeply \@seq, \@want; __DATA__ >NM_001 Homo sapiens ADA2 (CECR1) GATCCAA >NM_002 Homo sapiens IKBKG GGAGGTCTTTAGCTTTAGGGAAACCC` [download] See also How to ask better questions using Test::More and sample data. HTH.	[reply] [d/l]
Re^2: Size of sequences in fastafile by Sofie (Acolyte) on Feb 29, 2020 at 16:17 UTC
Thanks for the reply. But the code was not what I was looking for. Basically I just want to figure out the length of each sequence in a multifasta file. The files contains many sequences, each sequence has new lines inside, and I want to know how many nucleotides each sequence is and then extract only the sequences with more than x number of nt. Not sure if that makes sense? How to figure out the length of a sequence from a multifasta sequence file? Thanks!	[reply]
Re^3: Size of sequences in fastafile by hippo (Bishop) on Feb 29, 2020 at 16:51 UTC
It's still not very clear precisely what you expect but here's my final stab at this. use strict; use warnings; use Test::More tests => 1; # Present this SSCCE as a test my @seq; # Sequences longer than $min are stored here my @want = ( 'GGAGGTCTTTAGCTTTAGGGAAACCC', ); # These are the sequeneces we expect for the given data set my $min = 15; # Minimum length of any sequence to be considered my $this = ''; while (<DATA>) { chomp; if (/^>/) { push @seq, $this if $min <= length $this; $this = ''; } else { $this .= $_; } } push @seq, $this if $min <= length $this; is_deeply \@seq, \@want; # Check that our algorithm has worked __DATA__ >NM_001 Homo sapiens ADA2 (CECR1) GATCCAA >NM_002 Homo sapiens IKBKG GGAGGTCTTTAGCTTTAGGGAAACCC [download]	[reply] [d/l]
Re^4: Size of sequences in fastafile by Sofie (Acolyte) on Mar 01, 2020 at 12:03 UTC
Re^5: Size of sequences in fastafile by zubenel0 (Sexton) on Mar 01, 2020 at 12:39 UTC
Re^4: Size of sequences in fastafile by Sofie (Acolyte) on Mar 01, 2020 at 11:03 UTC
Re: Size of sequences in fastafile by BillKSmith (Monsignor) on Feb 29, 2020 at 20:23 UTC
I searched CPAN for 'fasta' and found several modules for extracting sequences. You probably can find one that meets your needs. Bill	[reply]
Re: Size of sequences in fastafile by kcott (Archbishop) on Mar 02, 2020 at 08:33 UTC
G'day Sofie, I normally handle fasta files by making the record terminator the '`>`'; note that the first record is a zero-length string which you'll typically want to discard. In your particular case here, you can split that record on a newline: the sequence will be the second element. Look at the code below and then some explanatory notes at the end. #!/usr/bin/env perl use strict; use warnings; use constant MIN_LENGTH => 10; my $long_seqs = 0; { local $/ = '>'; while (<DATA>) { next if $. == 1; my $seq = (split /\n/)[1]; next if MIN_LENGTH > length $seq; ++$long_seqs; print $seq, "\n"; } } if (not $long_seqs) { print 'No sequence is over ', MIN_LENGTH-1, "\n"; } __DATA__ >NM_001 Homo sapiens ADA2 (CECR1) GATCCAA >NM_002 Homo sapiens IKBKG GGAGGTCTTTAGCTTTAGGGAAACCC > sequence length = 9 -- should be discarded ABCDEFGHI > sequence length = 10 -- should be kept ABCDEFGHIJ > sequence length = 11 -- should be kept ABCDEFGHIJK [download] Sample run: `$ ./pm_11113575_fasta_extract.pl GGAGGTCTTTAGCTTTAGGGAAACCC ABCDEFGHIJ ABCDEFGHIJK` [download] Changing the `MIN_LENGTH` constant to `100`: `$ ./pm_11113575_fasta_extract.pl No sequence is over 99` [download] Notes: See "perlvar: Variables related to filehandles" for an explanation of '`$/`', '`$.`', and why I've put the `local` and `while` in an anonymous block. See "local" for more about that function; that page has links to further information. Purely for demonstration purposes, I've put the data under `__DATA__` and read it with the `DATA` filehandle (see "perldata: Special Literals" for more about those). Your code will have something more like '`(<$fh>)`'. See "open" for a better way to open your files: using lexical filehandles and the 3-argument form of that command. Your error reporting leaves much to be desired: which file couldn't open? why couldn't it open? It might be fine for a tiny test script like this; however, that inevitably gets expanded and built upon — then you can end up with bugs that are hard to track down (exacerbated by the use of package variables for filehandles). Take a look at the "autodie" pragma: with this you don't have to (remember to) write '`or die "..."`' (no point doing extra work when Perl can do it for you) and you also get much better reporting of which file had what problem. Note that I extended your test data — yes, I know those aren't valid sequences but they were handier for quickly checking results than random strings of nucleotides. Off-by-one errors are extremely common and very easy to make: you actually have one in the code you posted ("`... No sequence is over ...`") — compare with my code to spot the problem. In this case, using test (pseudo-)sequences that are '`MIN_LENGTH-1`', '`MIN_LENGTH`' and '`MIN_LENGTH+1`', you can quickly catch problems like that. Don't use the '`-w`' switch: see "perlrun: -w" for the reason. Always use the strict and warnings pragmata, as I've shown in my code: see "perlintro: Safety net" for the reason. — Ken	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom