Re: How to format fasta file

Hi,

First, your DNA sequences are less than 15 character long. I'll assume it is just a mistake in your example, but this may need further clarification.

Assuming that you want to:

- omit lines starting with ">" (headers);

- remove lines more than 30 or less than 15 characters, and

- count the number of occurrences of each individual sequence,

you could do something like this:

use strict;
use warnings;

my %count_seq;
while (<DATA>) { 
    chomp;
    next if /^>/;     # discard headers
    next if length($_) > 30 or length($_) < 15; # discard unwanted siz
+es
    $count_seq{$_}++;   # count occurrences
}
print "$_\t$count_seq{$_}\n" for keys %count_seq;

__DATA__
>dfbdbgf_356dfbdf
ATGGCTGGATATCGATT
>sdgthhr_478364df
ATGGCTATGGATCAGATT
>dfbdbgf_356dfbdf
ATGGCTATCGATT
>dfbdbgf_356dfbdg
ATGGCTGGATATCGATT
>sdgthhr_478364df
ATGGCTATGGATCGATT
>dfbdbgf_356dfbdg
ATGGCTGGATATCGATT
>sdgthhr_478364df
ATGGCTATGGATCGATT
TGCATGCGCTATTAGCG
ATGGCTATGGATCGATT
TGCATGCGCTATTAGCG
ATGGCTATGGATCGATT
TGCATGCCCTATTAGCG
[download]

This gives me the following result:

$ perl dna_seq.pl
TGCATGCGCTATTAGCG       2
ATGGCTATGGATCAGATT      1
ATGGCTGGATATCGATT       3
ATGGCTATGGATCGATT       4
TGCATGCCCTATTAGCG       1
[download]

Comment on Re: How to format fasta file Select or Download Code

Replies are listed 'Best First'.
Re^2: How to format fasta file by andyBio (Novice) on Apr 10, 2016 at 20:58 UTC
Thanks for the response. I tried the code but got an error: `Use of uninitialized value $_ in scalar chomp at test.pl line 37. Use of uninitialized value $_ in pattern match (m//) at test.pl line 3 +8. Use of uninitialized value $_ in numeric gt (>) at test.pl line 39. Use of uninitialized value $_ in numeric lt (<) at test.pl line 39.` [download] Lines 37 - 39 are: `chomp; next if /^>/; # discard headers next if length($_) > 30 or length($_) < 15; # discard unwanted + sizes` [download] Kindly assist. Thanks	[reply] [d/l] [select]
Re^3: How to format fasta file by ww (Archbishop) on Apr 10, 2016 at 21:36 UTC
Assisting: both errors appear to refer to a lack of content in `$_` the default variable set by various operations. `$_` is explicitly mentioned in your line 3 and implicit in your line 2. Inserting a `print $_\n` before the `chomp` will probably confirm or rebut my hypothesis. Then, if I'm on target, you can read up to find what you expected to set the default. Spirit of the Monastery	[reply] [d/l] [select]
Re^3: How to format fasta file by FreeBeerReekingMonk (Deacon) on Apr 10, 2016 at 23:24 UTC
Andy, better change that `'<DATA>'` into `'<STDIN>'`, then you can either: Windows: `type fasta.txt \| perl laurent_example.pl` Unix: `cat fasta.txt \| perl laurent_example.pl` And to get rid of your current error, just add this line after the chomp; `next if($_=~/^\s*$/);`	[reply] [d/l] [select]
Re^3: How to format fasta file by LanX (Saint) on Apr 10, 2016 at 21:34 UTC
Look Perlmonks is not a code writing service, we expect you to show efforts to learn programming. Hint: check on how to open a file instead of reading appended data like in Laurent's demonstration update ... and/or avoid empty lines after `__DATA__` if you are REALLY only running the demo code! Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l]
Re^3: How to format fasta file by Laurent_R (Canon) on Apr 11, 2016 at 06:18 UTC
Hi andyBio, the program I posted did not have any of the errors or warnings that you report. This most probably means that the error is somewhere in the changes you made to the code. Although I have some ideas on what might be wrong in your program (along the hints supplied by other monks), we can't fix a program that we don't see. The only way we can help you is if you show the code you're now using, with the changes you made.	[reply]
Re^4: How to format fasta file by andyBio (Novice) on Apr 11, 2016 at 18:17 UTC
Thanks a lot, Laurent R. Here is my complete code: #!/usr/bin/perl -w use strict; use warnings; my $num_args = @ARGV; if ($num_args != 3) { print "\nUsage: $0 (-q\|-a) <species> <input.(fq\|fa)>"; print "\nUse -q for fastq and -a for fasta files"; exit; } my $key=$ARGV[0]; my $species=$ARGV[1]; my $input=$ARGV[2]; if ($key ne "-q" && $key ne "-a"){ print "Unexpected option: $key, use -q or -a.\n"; exit; } if($key eq "-q") { print "Option q was selected.\n"; exit; } elsif( $key eq "-a" ) { print "Option a was selected.\n"; my %count_seq; open(my $fh, '<:encoding(UTF-8)', $input); while ($fh) { chomp; next if /^>/; # discard headers next if length($_) > 30 or length($_) < 15; # discard unwanted + sizes $count_seq{$_}++; # count occurrences } print "$_\t$count_seq{$_}\n" for keys %count_seq; exit; } [download] And here is how I run it: `perl test.pl -a spe try.txt > result` [download] It accepts either -q or -a as options. The error I get is this: `Use of uninitialized value $_ in scalar chomp at test.pl line 39. Use of uninitialized value $_ in pattern match (m//) at test.pl line 4 +0. Use of uninitialized value $_ in numeric gt (>) at test.pl line 41.` [download]	[reply] [d/l] [select]
Re^5: How to format fasta file by Laurent_R (Canon) on Apr 11, 2016 at 18:52 UTC
Re^6: How to format fasta file by andyBio (Novice) on Apr 12, 2016 at 00:22 UTC

update