Peter Keystrokes has asked for the wisdom of the Perl Monks concerning the following question:

I have a file which contains several 100s of fasta sequences in the form as such:
>hsa_circ_0000001|chr1:1080738-1080845-|None|None ATGGGGTTGGGTCAGCCGTGCGGTCAGGTCAGGTCGGCCATGAGGTCAGGTGGGGTCGGCCATGAAGGTG +GTGGGGGTCATGAGGTCACAAGGGGGTCGGCCATGTG >hsa_circ_0000002|chr1:1158623-1159348-|NM_016176|SDF4 GGTGGATGTGAACACTGACCGGAAGATCAGTGCCAAGGAGATGCAGCGCTGGATCATGGAGAAGACGGCC +GAGCACTTCCAGGAGGCCATGGAGGAGAGCAAGACACACTTCCGCGCCGTGGACCCTGACGGGGACGGT +CACGTGTCTTGGGACGAGTATAAGGTGAAGTTTTTGGCGAGTAAAGGCCATAGCGAGAAGGAGGTTGCC +GACGCCATCAGGCTCAACGAGGAACTCAAAGTGGATGAGGAAA
I wrote a script that counts the sequence length (bases) in each sequence and creates a file containing them all, but the problem is that it also adds the new line character into the count making it count+1. Is there a way to -1 from the count in order to obtain the true count of the sequence length? My script is as follows:
my $filename = 'counts.txt'; open (my $fh, '>', $filename) or die "Could not open '$filename' $!"; my $count = ""; while (my $line = <>){ if ($line =~ /^>hsa/){ $line = <>; $count .= length$line; $count .= " "; } } print $fh $count; close $fh;

Replies are listed 'Best First'.
Re: How to amend character count.
by Anonymous Monk on May 06, 2017 at 09:57 UTC
    chomp $line;
      Thank you, sir.
Re: How to amend character count.
by Anonymous Monk on May 06, 2017 at 12:34 UTC
    In addition to chomp, you might enjoy:
    $count = $line =~ tr/ACGT//; # count only the ACGT characters in $lin +e $line =~ tr/ACGT//cd; # remove all non-ACGT characters from $l +ine
Re: How to amend character count.
by AnomalousMonk (Archbishop) on May 07, 2017 at 07:31 UTC

    Ok guys, listen up. The implicit specification of the data given in the OP is that the base sequence to be counted is given in a single line of arbitrary length. The next specification bump will be to an arbitrary number of lines of arbitrary length for the base sequence. Be prepared.


    Give a man a fish:  <%-{-{-{-<

Re: How to amend character count.
by thanos1983 (Parson) on May 06, 2017 at 16:17 UTC

    Hello Peter Keystrokes,

    I notices a few things on your code.

    You have open (my $fh, '>', $filename) or die "Could not open '$filename' $!";. From the open documentation:

    Simple examples to open a file for reading: open(my $fh, "<", "input.txt") or die "Can't open < input.txt: $!"; and for writing: open(my $fh, ">", "output.txt") or die "Can't open > output.txt: $!";

    Also on your while loop you have while (my $line = <>){ without the file handler. From How can I read in an entire file all at once?:

    while (<$input>) { chomp; # do something with $_ }

    At the end of your code you are trying to write to a file print $fh $count;. You have opened the file in writing mode from the beginning but I am wondering how you are reading the file.

    So I assume you meant to write something like that?

    #!/usr/bin/perl use strict; use warnings; my $filename = 'counts.txt'; open (my $fh, '+<', $filename) or die "Could not open '$filename' $!"; my $count = ""; while (<$fh>) { chomp; if ($_ =~ /^>hsa/) { $count .= length $_; $count .= " "; } } print $fh $count . "\n"; close $fh or die "Could not close '$filename' $!"; __END__ >hsa_circ_0000001|chr1:1080738-1080845-|None|None ATGGGGTTGGGTCAGCCGTGCGGTCAGGTCAGGTCGGCCATGAGGTCAGGTGGGGTCGGCCATGAAGGTG +GTGGGGGTCATGAGGTCACAAGGGGGTCGGCCATGTG >hsa_circ_0000002|chr1:1158623-1159348-|NM_016176|SDF4 GGTGGATGTGAACACTGACCGGAAGATCAGTGCCAAGGAGATGCAGCGCTGGATCATGGAGAAGACGGCC +GAGCACTTCCAGGAGGCCATGGAGGAGAGCAAGACACACTTCCGCGCCGTGGACCCTGACGGGGACGGT +CACGTGTCTTGGGACGAGTATAAGGTGAAGTTTTTGGCGAGTAAAGGCCATAGCGAGAAGGAGGTTGCC +GACGCCATCAGGCTCAACGAGGAACTCAAAGTGGATGAGGAAA 49 54

    Notice the open (my $fh, '+<', $filename) or die "Could not open '$filename' $!";. On this mode you can read and write to the file.

    I do not know exactly why you want to read and write to your file, for me it would made more sense to read and print the output based on characters that they where counted but if this is what you desire.

    I would prefer to do it like this, just personal preference nothing special.

    #!/usr/bin/perl use strict; use warnings; my $filename = 'counts.txt'; open (my $fh, '+<', $filename) or die "Could not open '$filename' $!"; chomp(my @lines = <$fh>); for (@lines) { print $fh length($_) . " " if ($_ =~ /^>hsa/); } close $fh or die "Could not close '$filename' $!"; __END__ >hsa_circ_0000001|chr1:1080738-1080845-|None|None ATGGGGTTGGGTCAGCCGTGCGGTCAGGTCAGGTCGGCCATGAGGTCAGGTGGGGTCGGCCATGAAGGTG +GTGGGGGTCATGAGGTCACAAGGGGGTCGGCCATGTG >hsa_circ_0000002|chr1:1158623-1159348-|NM_016176|SDF4 GGTGGATGTGAACACTGACCGGAAGATCAGTGCCAAGGAGATGCAGCGCTGGATCATGGAGAAGACGGCC +GAGCACTTCCAGGAGGCCATGGAGGAGAGCAAGACACACTTCCGCGCCGTGGACCCTGACGGGGACGGT +CACGTGTCTTGGGACGAGTATAAGGTGAAGTTTTTGGCGAGTAAAGGCCATAGCGAGAAGGAGGTTGCC +GACGCCATCAGGCTCAACGAGGAACTCAAAGTGGATGAGGAAA 49 54

    Update 3: Thanks to Anonymous Monk, AnomalousMonk, and Laurent_R I finally understand my mistakes and this is the updated recommended version.

    #!/usr/bin/perl use strict; use warnings; my $count; while (<>) { if ($_ =~ /^>hsa/){ chomp (my $line = <>); $count .= length($line); $count .= " "; print "Count final: " . $count . "\n"; } } continue { close ARGV if eof; # reset $. each file } my $filename = 'counts.txt'; open (my $fh, '>', $filename) or die "Could not open '$filename' $!"; print $fh $count; close $fh or die "Could not close '$filename' $!"; __END__ $ perl test.pl test.txt Count final: 107 Count final: 107 251

    And the way that I would try to resolve it.

    #!/usr/bin/perl use strict; use warnings; my $filename = 'counts.txt'; open (my $fh, '>', $filename) or die "Could not open '$filename' $!"; chomp(my @lines = <>); for (@lines) { next if ($_ =~ /^>hsa/); print $fh length($_) . " "; } print $fh "\n"; close $fh or die "Could not close '$filename' $!"; __END__ $ cat counts.txt 107 251

    Update: 4 Even better as Anonymous Monk point it out:

    #!/usr/bin/perl use strict; use warnings; my $filename = 'counts.txt'; open (my $fh, '>', $filename) or die "Could not open '$filename' $!"; chomp(my @lines = <>); for (@lines) { print $fh length($_) . " " if ($_ =~ tr/ACGT//); } print $fh "\n"; close $fh or die "Could not close '$filename' $!"; __END__ $ cat counts.txt 107 251

    Hope this helps.

    Seeking for Perl wisdom...on the process of learning...not there...yet!

      Hi thanos1983. Some comments on your post which I hope may avoid future confusion.

      You have open (my $fh, '>', $filename) or die "Could not open '$filename' $!";. From the open documentation ...

      Except for the use of  $filename in the OPed code instead of repeated literal strings (which I think is good practice in the interest of DRYness) and the inclusion of the mode in the error message, the OPed code is essentially the same as the documented examples. I don't understand the point you want to make.

      Also on your while loop you have while (my $line = <>){ without the file handler.

      I think that Peter Keystrokes wants to specify the input file on the command line, e.g.
          perl script.pl input.file
      IOW, to read and write to separate files rather than appending to the input file as your final code example does | both your code examples do.

      while (<$fh>) { chomp; if ($_ =~ /^>hsa/) { $count .= length $_; $count .= " "; } }
      Update:
      for (@lines) { print $fh length($_) . " " if ($_ =~ /^>hsa/); }

      This loop counts | These loops count the characters in the lines beginning  '>hsa' rather than in the single string of the  'ACTG...' line that follows. The counts of 49 and 54 characters, respectively, for the two example data records reflect this. (And again, these counts are appended to the end of the input file rather than written to a separate file.)

      Please believe that these comments are not meant to "pile on," but are only offered in the hope of smoothing the way a bit for future Seekers of Perl Wisdom.


      Give a man a fish:  <%-{-{-{-<

        Hello AnomalousMonk,

        No worries no problems, I am always open to read comments from monks which have greater knowledge.

        I am just commending to avoid confusion, OP on the open function has write mode > instead of read mode <.

        Regarding:

        I think that Peter Keystrokes wants to specify the input file on the command line, e.g.

        OP defined a file my $filename = 'counts.txt';, so I assumed that he is defining the file through the script not through the command line. But never the less I was not aware of Using the magic ARGV handle - <> in perlop. as anonymous point out. Thanks for the tip. :)

        Regarding:

        This loop counts the characters in the lines beginning '>hsa' rather than in the single string of the 'ACTG...' line that follows. The counts of 49 and 54 characters, respectively, for the two example data records reflect this. (And again, these counts are appended to the end of the input file rather than written to a separate file.)

        I know some how that my results is wrong but I still can not understand why I can not link the lines that start with >hsa and why it only captures the ATG.... Can you provide me a bit more information? (Thanks in advance for your time and effort)

        Seeking for Perl wisdom...on the process of learning...not there...yet!
        Thank you guys I am trying to absorb all your feedback (back and forth) as I'm pretty much a novice at this. I have been trawling the web taking bits of code and trying to make something work albeit not perfectly. At least here I can try to refine what little I have learnt. I appreciate this.
      I am wondering how you are reading the file.

      Using the magic ARGV handle - <> in perlop.

      I do not know exactly why you want to read and write to your file

      OP never said they wanted to.

      Your code produces different answers from the OP's, the correct answer is "107 251 ".

      for(@lines) has no advantage over while (<>) here, it's slower and takes more memory.

        Hello Anonymous Monk,

        I had no clue about that, the only reason that I would use like this is to save some coding lines.

        But if while (<>) is faster then the obvious solution would be this.

        Thanks for providing this information.

        Seeking for Perl wisdom...on the process of learning...not there...yet!