in reply to How to format fasta file

Hi,

First, your DNA sequences are less than 15 character long. I'll assume it is just a mistake in your example, but this may need further clarification.

Assuming that you want to:

- omit lines starting with ">" (headers);

- remove lines more than 30 or less than 15 characters, and

- count the number of occurrences of each individual sequence,

you could do something like this:

use strict; use warnings; my %count_seq; while (<DATA>) { chomp; next if /^>/; # discard headers next if length($_) > 30 or length($_) < 15; # discard unwanted siz +es $count_seq{$_}++; # count occurrences } print "$_\t$count_seq{$_}\n" for keys %count_seq; __DATA__ >dfbdbgf_356dfbdf ATGGCTGGATATCGATT >sdgthhr_478364df ATGGCTATGGATCAGATT >dfbdbgf_356dfbdf ATGGCTATCGATT >dfbdbgf_356dfbdg ATGGCTGGATATCGATT >sdgthhr_478364df ATGGCTATGGATCGATT >dfbdbgf_356dfbdg ATGGCTGGATATCGATT >sdgthhr_478364df ATGGCTATGGATCGATT TGCATGCGCTATTAGCG ATGGCTATGGATCGATT TGCATGCGCTATTAGCG ATGGCTATGGATCGATT TGCATGCCCTATTAGCG
This gives me the following result:
$ perl dna_seq.pl TGCATGCGCTATTAGCG 2 ATGGCTATGGATCAGATT 1 ATGGCTGGATATCGATT 3 ATGGCTATGGATCGATT 4 TGCATGCCCTATTAGCG 1

Replies are listed 'Best First'.
Re^2: How to format fasta file
by andyBio (Novice) on Apr 10, 2016 at 20:58 UTC
    Thanks for the response. I tried the code but got an error:
    Use of uninitialized value $_ in scalar chomp at test.pl line 37. Use of uninitialized value $_ in pattern match (m//) at test.pl line 3 +8. Use of uninitialized value $_ in numeric gt (>) at test.pl line 39. Use of uninitialized value $_ in numeric lt (<) at test.pl line 39.
    Lines 37 - 39 are:
    chomp; next if /^>/; # discard headers next if length($_) > 30 or length($_) < 15; # discard unwanted + sizes
    Kindly assist. Thanks

      Assisting: both errors appear to refer to a lack of content in $_ the default variable set by various operations. $_ is explicitly mentioned in your line 3 and implicit in your line 2.

      Inserting a print $_\n before the chomp will probably confirm or rebut my hypothesis. Then, if I'm on target, you can read up to find what you expected to set the default.

      Andy, better change that '<DATA>' into '<STDIN>', then you can either:

      Windows:

      type fasta.txt | perl laurent_example.pl

      Unix:

      cat fasta.txt | perl laurent_example.pl

      And to get rid of your current error, just add this line after the chomp;

      next if($_=~/^\s*$/);

      Look Perlmonks is not a code writing service, we expect you to show efforts to learn programming.

      Hint: check on how to open a file instead of reading appended data like in Laurent's demonstration

      update

      ... and/or avoid empty lines after __DATA__ if you are REALLY only running the demo code!

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

      Hi andyBio,

      the program I posted did not have any of the errors or warnings that you report. This most probably means that the error is somewhere in the changes you made to the code.

      Although I have some ideas on what might be wrong in your program (along the hints supplied by other monks), we can't fix a program that we don't see. The only way we can help you is if you show the code you're now using, with the changes you made.

        Thanks a lot, Laurent R. Here is my complete code:
        #!/usr/bin/perl -w use strict; use warnings; my $num_args = @ARGV; if ($num_args != 3) { print "\nUsage: $0 (-q|-a) <species> <input.(fq|fa)>"; print "\nUse -q for fastq and -a for fasta files"; exit; } my $key=$ARGV[0]; my $species=$ARGV[1]; my $input=$ARGV[2]; if ($key ne "-q" && $key ne "-a"){ print "Unexpected option: $key, use -q or -a.\n"; exit; } if($key eq "-q") { print "Option q was selected.\n"; exit; } elsif( $key eq "-a" ) { print "Option a was selected.\n"; my %count_seq; open(my $fh, '<:encoding(UTF-8)', $input); while ($fh) { chomp; next if /^>/; # discard headers next if length($_) > 30 or length($_) < 15; # discard unwanted + sizes $count_seq{$_}++; # count occurrences } print "$_\t$count_seq{$_}\n" for keys %count_seq; exit; }
        And here is how I run it:
        perl test.pl -a spe try.txt > result
        It accepts either -q or -a as options. The error I get is this:
        Use of uninitialized value $_ in scalar chomp at test.pl line 39. Use of uninitialized value $_ in pattern match (m//) at test.pl line 4 +0. Use of uninitialized value $_ in numeric gt (>) at test.pl line 41.