GSperlbio has asked for the wisdom of the Perl Monks concerning the following question:

I have a file named seq.txt and i have hundreds of nucleotide sequences in fasta format each starting with symbol ">" i have to count no of > symbols to find how many seqences does that file has. can anyone write a perl code for that???

Replies are listed 'Best First'.
Re: Counting no of '>' symbol in a file
by shmem (Chancellor) on Aug 31, 2015 at 10:51 UTC
    can anyone write a perl code for that???

    Lots of people here can do that. Hopefully you can do in the near future! Read the perl docs to find out what the following does:

    perl -n0777 -e 'print s/>/>/g,"\n"' fastafile
    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
      Substitution? Matching should be enough:
      perl -lne '$c++ while />/g }{ print $c' fasta-file # or perl -lne '$c += () = />/g }{ print $c' fasta-file

      Moreover, in a fasta file, the > appears only once per line, and only as the very first character, so you can simplify that to

      perl -lne '$c++ if /^>/ }{ print $c' fasta-file
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        Moreover, in a fasta file, the > appears only once per line, and only as the very first character

        well, then

        grep -c '^>' fasta-file

        should do.

        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
        I have never seen '}{...' used instead of the END block suggested by the -n paragraph in perlrun. Is this an idiom that we can (or should) be using?
        Bill

        I was curious so I ran some benchmarks. The substitution approach holds its own against the matching approach and I find it easier to code from memory. But the grep() and tr/// options are very good too.

        This wasn't the most scientific comparison. I threw in a bunch of things I was interested in. Some approaches are faster if you slurp the file some if you do line by line and another if you make an array. Compared to reading the file off the disc any of these should be good enough for an average application.

        use warnings; use strict; use Benchmark qw( timethese cmpthese ); my $sample = ">gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus + maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL "; foreach(1..10){ $sample .= $sample; } my @sample_array = split "\n", $sample; sub tr1_count_ar { return $sample =~ tr/>//; } sub tr2_count_ar { my $c; foreach(@sample_array) { $c += $_ =~ tr/>//; } return $c; } sub s1_count_ar { my $c; foreach(@sample_array) { $c += $_ =~ s/^>/>/g; } return $c; } sub s2_count_ar { return $sample =~ s/>/>/g; } sub m1_count_ar { my $c; $c += () = $sample =~ m/>/g; return $c; } sub m2_count_ar { my $c; foreach(@sample_array) { $c += () = $_ =~ m/^>/; } return $c; } sub m3_count_ar { my $c; foreach(@sample_array) { $c++ if $_ =~ m/^>/; } return $c; } sub g_count_ar { my $c; $c = scalar grep /^>/, @sample_array; return $c; } print "tr1: ", tr1_count_ar( ) ,"\n"; print "tr2: ", tr2_count_ar( ) ,"\n"; print "s1: ", s1_count_ar( ) ,"\n"; print "s2: ", s2_count_ar( ) ,"\n"; print "m1: ", m1_count_ar( ) ,"\n"; print "m2: ", m2_count_ar( ) ,"\n"; print "m3: ", m3_count_ar( ) ,"\n"; print "g: ", g_count_ar( ) ,"\n"; print '-'x80,"\n"; print join "\n", @sample_array[0..6]; print "\n",'-'x80,"\n"; my $results = timethese ( -10, { 'tr1' => 'tr1_count_ar', 'tr2' => 'tr2_count_ar', 's1' => 's1_count_ar', 's2' => 's2_count_ar', 'm1' => 'm1_count_ar', 'm2' => 'm2_count_ar', 'm3' => 'm3_count_ar', 'grep' => 'g_count_ar' } ); cmpthese($results);

        The results:

        tr1: 2048 tr2: 2048 s1: 2048 s2: 2048 m1: 2048 m2: 2048 m3: 2048 g: 2048 ---------------------------------------------------------------------- +---------- >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY >SEQUENCE_1 ---------------------------------------------------------------------- +---------- Benchmark: running grep, m1, m2, m3, s1, s2, tr1, tr2 for at least 10 +CPU seconds... grep: 11 wallclock secs (10.25 usr + 0.00 sys = 10.25 CPU) @ 89 +4.04/s (n=9163) m1: 11 wallclock secs (10.81 usr + 0.00 sys = 10.81 CPU) @ 72 +8.89/s (n=7880) m2: 10 wallclock secs (10.58 usr + 0.00 sys = 10.58 CPU) @ 38 +5.18/s (n=4074) m3: 11 wallclock secs (10.47 usr + 0.00 sys = 10.47 CPU) @ 59 +9.25/s (n=6273) s1: 11 wallclock secs (10.44 usr + 0.00 sys = 10.44 CPU) @ 43 +2.16/s (n=4510) s2: 11 wallclock secs (10.39 usr + 0.00 sys = 10.39 CPU) @ 92 +6.56/s (n=9626) tr1: 10 wallclock secs (10.52 usr + 0.00 sys = 10.52 CPU) @ 24 +73.89/s (n=26013) tr2: 11 wallclock secs (10.64 usr + 0.00 sys = 10.64 CPU) @ 58 +1.26/s (n=6184) Rate m2 s1 tr2 m3 m1 grep s2 tr1 m2 385/s -- -11% -34% -36% -47% -57% -58% -84% s1 432/s 12% -- -26% -28% -41% -52% -53% -83% tr2 581/s 51% 35% -- -3% -20% -35% -37% -77% m3 599/s 56% 39% 3% -- -18% -33% -35% -76% m1 729/s 89% 69% 25% 22% -- -18% -21% -71% grep 894/s 132% 107% 54% 49% 23% -- -4% -64% s2 927/s 141% 114% 59% 55% 27% 4% -- -63% tr1 2474/s 542% 472% 326% 313% 239% 177% 167% --
Re: Counting no of '>' symbol in a file
by johngg (Canon) on Aug 31, 2015 at 14:24 UTC

    Instead of substitution or matching you could use tr.

    $ perl -Mstrict -Mwarnings -E ' open my $fastaFH, q{<}, \ <<EOF or die $!; > Seq 1 CTGTCA GTCCTC > Seq 2 GGCTA > Seq 3 CATGAG ATTCG EOF my $seqCt = 0; $seqCt += tr{>}{} while <$fastaFH>; say $seqCt;' 3 $

    I hope this is of interest.

    Cheers,

    JohnGG

Re: Counting no of '>' symbol in a file
by stevieb (Canon) on Aug 31, 2015 at 14:39 UTC

    Because you're counting the number of lines and not the character count per line (per choroba stating that a fasta file will only have one '>' per line), grep will work.

    perl -E 'say scalar grep />/, <>' file
Re: Counting no of '>' symbol in a file
by ravi45722 (Pilgrim) on Aug 31, 2015 at 11:25 UTC

    As shmem guided read the document of s///g. Hope you are trying for this

    #!/usr/bin/perl use strict; use warnings; open (FILE,"test.txt"); my $total=0; while (<FILE>) { $total=$total+s/>/>/g; } print $total;

    GIVE A MAN A FISH.

Re: Counting no of '>' symbol in a file
by locked_user sundialsvc4 (Abbot) on Aug 31, 2015 at 12:40 UTC

    To elaborate a little as to what is going on here:

    (1)   The g modifier (the /g at the end of the regex ...) refer to global matching as described in this perldoc:   perlre.   These cause the regex to be applied repeatedly to the same string.   (Also see perlretut, an excellent tutorial.)

    (2)   The while loops, as written, are using an implied variable ($_ ... see perlvar) which contains the most recent line read from the file.   The regular expression, also, implicitly matches to that variable since no other variable was referred-to in the statement.   By design, Perl allows you to say things with an economy of characters . . .

    Although “TMTOWTDI = There’s More Than One Way To Do It,™” I also share the opinion that substitution isn’t necessary ... merely matching, and counting the number of matches found, as demonstrated above by choroba’s post.

    HTH ...

Re: Counting no of '>' symbol in a file
by ambrus (Abbot) on Sep 01, 2015 at 07:30 UTC