JediGorf has asked for the wisdom of the Perl Monks concerning the following question:

My code below is resulting in missing (letter) characters from the lines of my input file when I output them to $seq. Any ideas how to fix this? My input file looks like:

>symbB.v1.2.017277.t1|scaffold1325.1|size176917|3

acggaccgcggcatttgccaatttgcgcgtcgtcgggggtcgccatgatgtttcgcttggcaggcttttttgctttggcactgctggtcgcgggaaagcc caagggtggcaaaggtgcaaaaggagaacaagaccccttctctgagcttagccgcctcgcagacaatttgaaagatgctaaagaacagccggagaaggcc aagaatgctctgaacatgatggatccagaaagtttaggcgattctatggccaacatgatggtgatggcaatggataaggaccaggatggtgtgttgtcag

open ($input, $ARGV[0]) or die ("Could not open input file $ARGV[0].\n +"); while ($line = <$input>) { chomp($line); unless ($line =~ m/>/) { $line = uc($line); $seq .= $line; } }

Replies are listed 'Best First'.
Re: missing character when reading input file
by Eily (Monsignor) on Sep 11, 2018 at 15:38 UTC

    At first it looked like there were several lines of data (ignoring the first two lines), but there is only one, and the pieces of sequence are actually on the same line, separated by spaces. So if what you call "missing characters" are actually spaces in the output that's because they are already there in the input.

    If that's not the issue, you'll have to show the expected output and the one you actually get.

      As an example:

      Input of first line: acggaccgcggcatttgccaatttgcgcgtcgtcgggggtcgccatgatgtttcgcttggcaggcttttttgctttggcactgctggtcgcgggaaagcc

      Output of first line: ACGGACCGCGGCATTTGCCAATTTGCGCGTCGTCGGGGGTCGCCATGATGTTTCGCTTGGCAGGCTTTTTTGCTTTGGCA

      So the sequence: ctgctggtcgcgggaaagcc is completely gone from my output. Interestingly, the output is 80 chars while the input is 100, so the missing chunk is exactly 20. Is Perl somehow limited to only 80 chars in a line of text?

        Still not a lot of info to do | go on, but let me suggest you have "invisible" characters lurking in your input:

        c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $seq = qq{what is wrong with this\x0dline}; print $seq; dd $seq; " line is wrong with this "what is wrong with this\rline"
        Some kind of data dump (Data::Dumper, Data::Dump) might be informative. (Data::Dumper is core and so should already be installed.)


        Give a man a fish:  <%-{-{-{-<

        Perl has no such limitations, so it comes from the way you display your output. I'm guessing this is on a terminal? If so, try forwarding it to a file instead

        Further to Eily's post:

        80 chars really sounds like a display limit ...

        JediGorf:   Is the truncation always at 80 characters?


        Give a man a fish:  <%-{-{-{-<

      With a space between the "lines" instead of a carriage return, is Perl not getting each "line" from: while ($line = <$input>) { }? Never mind. When I look at the original input file that I'm using, those are carriage returns at the ends of the lines.

        Is that the complete input file. If not how long is the last line ?

        poj
        When I look at the original input file ... those are carriage returns at the ends of the lines.

        Is it literally true that you are seeing "carriage return" (ASCII 0x0d) characters at the ends of your lines? If so, that may be a big part of your problem. On what sort of system do the files originate? Windows? On what sort of system are you processing the files? *nix? Line-ending delimiters are different on these two systems (and on others).


        Give a man a fish:  <%-{-{-{-<

      Input:

      >symbB.v1.2.017277.t1|scaffold1325.1|size176917|3
      acggaccgcggcatttgccaatttgcgcgtcgtcgggggtcgccatgatgtttcgcttggcaggcttttttgctttggcactgctggtcgcgggaaagcc caagggtggcaaaggtgcaaaaggagaacaagaccccttctctgagcttagccgcctcgcagacaatttgaaagatgctaaagaacagccggagaaggcc aagaatgctctgaacatgatggatccagaaagtttaggcgattctatggccaacatgatggtgatggcaatggataaggaccaggatggtgtgttgtcag aggaggagattgccaccatggtacaatcgggagagacggagaacaaaggcaaagcagaggagatgtttgaacagatggatgaagatggcgatggagaggt aaccagagacgaagcgaaggtgtacttttcaaaactaggaaacaccttgcaaggcctttcaaaaatgatgggtggttcaaaatcagagttgtgattggga gaatcgctctgtcaaccgcctgcggtggtgcggttggaaggttgagcttgaaaggtgcgaagcgtctctccattggtgtccgagatagcctgagatagcc tgagatatttaggtgatactgtatcttcttgggttttcggatgcaaatttttgacaacagatcagaaatccatccgaatatcccggccgccagggcaaaa atcatagagtttccctgtgcagaagctgcaagttttgagtttttctcattccattggggcctgaatttcaagaaaatcgtatagtctta

      Output:

      ACGGACCGCGGCATTTGCCAATTTGCGCGTCGTCGGGGGTCGCCATGATGTTTCGCTTGGCAGGCTTTTTTGCTTTGGCACTGCTGGTCGCGGGAAAGCC CAAGGGTGGCAAAGGTGCAAAAGGAGAACAAGACCCCTTCTCTGAGCTTAGCCGCCTCGCAGACAATTTGAAAGATGCTAAAGAACAGCCGGAGAAGGCC AAGAATGCTCTGAACATGATGGATCCAGAAAGTTTAGGCGATTCTATGGCCAACATGATGGTGATGGCAATGGATAAGGACCAGGATGGTGTGTTGTCAG AGGAGGAGATTGCCACCATGGTACAATCGGGAGAGACGGAGAACAAAGGCAAAGCAGAGGAGATGTTTGAACAGATGGATGAAGATGGCGATGGAGAGGT AACCAGAGACGAAGCGAAGGTGTACTTTTCAAAACTAGGAAACACCTTGCAAGGCCTTTCAAAAATGATGGGTGGTTCAAAATCAGAGTTGTGATTGGGA GAATCGCTCTGTCAACCGCCTGCGGTGGTGCGGTTGGAAGGTTGAGCTTGAAAGGTGCGAAGCGTCTCTCCATTGGTGTCCGAGATAGCCTGAGATAGCC TGAGATATTTAGGTGATACTGTATCTTCTTGGGTTTTCGGATGCAAATTTTTGACAACAGATCAGAAATCCATCCGAATATCCCGGCCGCCAGGGCAAAA ATCATAGAGTTTCCCTGTGCAGAAGCTGCAAGTTTTGAGTTTTTCTCATTCCATTGGGGCCTGAATTTCAAGAAAATCGTATAGTCTTA CGGACCGCGGCATTTGCCAATTTGCGCGTCGTCGGGGGTCGCCATGATGTTTCGCTTGGCAGGCTTTTTTGCTTTGGCACTGCTGGTCGCGGGAAAGCC CAAGGGTGGCAAAGGTGCAAAAGGAGAACAAGACCCCTTCTCTGAGCTTAGCCGCCTCGCAGACAATTTGAAAGATGCTAAAGAACAGCCGGAGAAGGCC AAGAATGCTCTGAACATGATGGATCCAGAAAGTTTAGGCGATTCTATGGCCAACATGATGGTGATGGCAATGGATAAGGACCAGGATGGTGTGTTGTCAG AGGAGGAGATTGCCACCATGGTACAATCGGGAGAGACGGAGAACAAAGGCAAAGCAGAGGAGATGTTTGAACAGATGGATGAAGATGGCGATGGAGAGGT AACCAGAGACGAAGCGAAGGTGTACTTTTCAAAACTAGGAAACACCTTGCAAGGCCTTTCAAAAATGATGGGTGGTTCAAAATCAGAGTTGTGATTGGGA GAATCGCTCTGTCAACCGCCTGCGGTGGTGCGGTTGGAAGGTTGAGCTTGAAAGGTGCGAAGCGTCTCTCCATTGGTGTCCGAGATAGCCTGAGATAGCC TGAGATATTTAGGTGATACTGTATCTTCTTGGGTTTTCGGATGCAAATTTTTGACAACAGATCAGAAATCCATCCGAATATCCCGGCCGCCAGGGCAAAA ATCATAGAGTTTCCCTGTGCAGAAGCTGCAAGTTTTGAGTTTTTCTCATTCCATTGGGGCCTGAATTTCAAGAAAATCGTATAGTCTTA GGACCGCGGCATTTGCCAATTTGCGCGTCGTCGGGGGTCGCCATGATGTTTCGCTTGGCAGGCTTTTTTGCTTTGGCACTGCTGGTCGCGGGAAAGCC CAAGGGTGGCAAAGGTGCAAAAGGAGAACAAGACCCCTTCTCTGAGCTTAGCCGCCTCGCAGACAATTTGAAAGATGCTAAAGAACAGCCGGAGAAGGCC AAGAATGCTCTGAACATGATGGATCCAGAAAGTTTAGGCGATTCTATGGCCAACATGATGGTGATGGCAATGGATAAGGACCAGGATGGTGTGTTGTCAG AGGAGGAGATTGCCACCATGGTACAATCGGGAGAGACGGAGAACAAAGGCAAAGCAGAGGAGATGTTTGAACAGATGGATGAAGATGGCGATGGAGAGGT AACCAGAGACGAAGCGAAGGTGTACTTTTCAAAACTAGGAAACACCTTGCAAGGCCTTTCAAAAATGATGGGTGGTTCAAAATCAGAGTTGTGATTGGGA GAATCGCTCTGTCAACCGCCTGCGGTGGTGCGGTTGGAAGGTTGAGCTTGAAAGGTGCGAAGCGTCTCTCCATTGGTGTCCGAGATAGCCTGAGATAGCC TGAGATATTTAGGTGATACTGTATCTTCTTGGGTTTTCGGATGCAAATTTTTGACAACAGATCAGAAATCCATCCGAATATCCCGGCCGCCAGGGCAAAA ATCATAGAGTTTCCCTGTGCAGAAGCTGCAAGTTTTGAGTTTTTCTCATTCCATTGGGGCCTGAATTTCAAGAAAATCGTATAGTCTTA

      open ($input, $ARGV[0]) or die ("Could not open input file $ARGV[0].\n +"); open ($output, '>output.txt') or die ("Could not open input file outpu +t.txt.\n"); while ($line = <$input>) { chomp($line); unless ($line =~ m/>/) { $line = uc($line); $seq .= $line; } } $orf1 = $seq; $orf2 = substr($seq, 1); $orf3 = substr($seq, 2);
      OK, so the problem is a little different than what I thought earlier. What I'm trying to get is for $seq to be a continuous string of letters, but as you can see, when I use substr it deletes characters on the first line like it should but doesn't shift the other lines (i.e. the GCC on the first line should connect right to the CAA that begins the second line of the output without any spaces or newlines). I'm guessing this is some sort of newline issue? But I used chomp, so I'm confused.
        Added this $line =~ s/\r//; and it fixed the issue. Newlines/carriage returns can be a huge pain.
Re: missing character when reading input file
by BillKSmith (Monsignor) on Sep 12, 2018 at 04:03 UTC
    It now appears that your data file was created on a windows system and you are reading it with a unix-like system. Perl can handle this, but you have to tell it using IO layers in the open statement. (Refer perlio)
    use strict; use warnings; BEGIN { my $win_file # memory file simulates windows file. = ">symbB.v1.2.017277.t1|scaffold1325.1|size176917|3\r\n" . "acggaccgcggcatttgccaatttgcgcgt" . "cgtcgggggtcgccatgatgtttcgcttgg" . "caggcttttttgctttggcactgctggtcg" . "cgggaaagcc\r\n" . "caagggtggcaaaggtgcaaaaggagaaca" . "agaccccttctctgagcttagccgcctcgc" . "agacaatttgaaagatgctaaagaacagcc" . "ggagaaggcc\r\n" . "aagaatgctctgaacatgatggatccagaa" . "agtttaggcgattctatggccaacatgatg" . "gtgatggcaatggataaggaccaggatggt" . "gtgttgtcag\r\n" ; $ARGV[0] = \do{$win_file}; } open( my $input, '<:crlf', $ARGV[0] ) or die( "Could not open input file $ARGV[0].\n" ); my $seq; while ( my $line = <$input> ) { chomp($line); unless ( $line =~ m/>/ ) { $line = uc($line); $seq .= $line; } } print "Length of \$seq is ", length($seq), " characters\n";
    Bill

      I agree with your example of using open to handle a CRLFish file on a *nix system. The rest of this post is just to satisfy my curiosity.

      $ARGV[0] = \do{$win_file};

      I don't understand the purpose of munging  @ARGV in this way. Any scalar can be opened by reference as a RAM file. If the reason for initializing the scalar in a  BEGIN block was to create a lexically private scalar, then assigning a reference to it to an element of the global  @ARGV array defeats this purpose.


      Give a man a fish:  <%-{-{-{-<

        The array @ARGV was used to preserve as much of the original OP as possible. The 'BEGIN' is unnecessary, but I feel that it serves to separate the file simulation from the relevant code. No excuse for the 'do' block. It is a leftover from an earlier attempt at the file simulation.
        Bill

      hey that's a cool new trick I learned today, setting $ARGV[0] = \do{$win_file};

      I like your method as an easy way to produce a CRLF file in non-windows machine (e.g. for tests).