in reply to missing character when reading input file

At first it looked like there were several lines of data (ignoring the first two lines), but there is only one, and the pieces of sequence are actually on the same line, separated by spaces. So if what you call "missing characters" are actually spaces in the output that's because they are already there in the input.

If that's not the issue, you'll have to show the expected output and the one you actually get.

  • Comment on Re: missing character when reading input file

Replies are listed 'Best First'.
Re^2: missing character when reading input file
by JediGorf (Initiate) on Sep 11, 2018 at 15:47 UTC
    As an example:

    Input of first line: acggaccgcggcatttgccaatttgcgcgtcgtcgggggtcgccatgatgtttcgcttggcaggcttttttgctttggcactgctggtcgcgggaaagcc

    Output of first line: ACGGACCGCGGCATTTGCCAATTTGCGCGTCGTCGGGGGTCGCCATGATGTTTCGCTTGGCAGGCTTTTTTGCTTTGGCA

    So the sequence: ctgctggtcgcgggaaagcc is completely gone from my output. Interestingly, the output is 80 chars while the input is 100, so the missing chunk is exactly 20. Is Perl somehow limited to only 80 chars in a line of text?

      Still not a lot of info to do | go on, but let me suggest you have "invisible" characters lurking in your input:

      c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $seq = qq{what is wrong with this\x0dline}; print $seq; dd $seq; " line is wrong with this "what is wrong with this\rline"
      Some kind of data dump (Data::Dumper, Data::Dump) might be informative. (Data::Dumper is core and so should already be installed.)


      Give a man a fish:  <%-{-{-{-<

        For this to work with Data::Dumper you need to set $Data::Dumper::Useqq = 1; Otherwise the \r isn't replaced by visible chars.

        perl -MData::Dumper -E "print Dumper qq<a\rb>" b';R1 = 'a
        perl -MData::Dumper -E "$Data::Dumper::Useqq = 1; print Dumper qq<a\rb +>" $VAR1 = "a\rb";
        One of the main reasons to prefer Data::Dump over Data::Dumper for debugging :)

        80 chars really sounds like a display limit though, rather than an invisible char

      Perl has no such limitations, so it comes from the way you display your output. I'm guessing this is on a terminal? If so, try forwarding it to a file instead

      Further to Eily's post:

      80 chars really sounds like a display limit ...

      JediGorf:   Is the truncation always at 80 characters?


      Give a man a fish:  <%-{-{-{-<

Re^2: missing character when reading input file
by JediGorf (Initiate) on Sep 11, 2018 at 15:53 UTC
    With a space between the "lines" instead of a carriage return, is Perl not getting each "line" from: while ($line = <$input>) { }? Never mind. When I look at the original input file that I'm using, those are carriage returns at the ends of the lines.

      Is that the complete input file. If not how long is the last line ?

      poj
      When I look at the original input file ... those are carriage returns at the ends of the lines.

      Is it literally true that you are seeing "carriage return" (ASCII 0x0d) characters at the ends of your lines? If so, that may be a big part of your problem. On what sort of system do the files originate? Windows? On what sort of system are you processing the files? *nix? Line-ending delimiters are different on these two systems (and on others).


      Give a man a fish:  <%-{-{-{-<

Re^2: missing character when reading input file
by JediGorf (Initiate) on Sep 12, 2018 at 00:33 UTC
    Input:

    >symbB.v1.2.017277.t1|scaffold1325.1|size176917|3
    acggaccgcggcatttgccaatttgcgcgtcgtcgggggtcgccatgatgtttcgcttggcaggcttttttgctttggcactgctggtcgcgggaaagcc caagggtggcaaaggtgcaaaaggagaacaagaccccttctctgagcttagccgcctcgcagacaatttgaaagatgctaaagaacagccggagaaggcc aagaatgctctgaacatgatggatccagaaagtttaggcgattctatggccaacatgatggtgatggcaatggataaggaccaggatggtgtgttgtcag aggaggagattgccaccatggtacaatcgggagagacggagaacaaaggcaaagcagaggagatgtttgaacagatggatgaagatggcgatggagaggt aaccagagacgaagcgaaggtgtacttttcaaaactaggaaacaccttgcaaggcctttcaaaaatgatgggtggttcaaaatcagagttgtgattggga gaatcgctctgtcaaccgcctgcggtggtgcggttggaaggttgagcttgaaaggtgcgaagcgtctctccattggtgtccgagatagcctgagatagcc tgagatatttaggtgatactgtatcttcttgggttttcggatgcaaatttttgacaacagatcagaaatccatccgaatatcccggccgccagggcaaaa atcatagagtttccctgtgcagaagctgcaagttttgagtttttctcattccattggggcctgaatttcaagaaaatcgtatagtctta

    Output:

    ACGGACCGCGGCATTTGCCAATTTGCGCGTCGTCGGGGGTCGCCATGATGTTTCGCTTGGCAGGCTTTTTTGCTTTGGCACTGCTGGTCGCGGGAAAGCC CAAGGGTGGCAAAGGTGCAAAAGGAGAACAAGACCCCTTCTCTGAGCTTAGCCGCCTCGCAGACAATTTGAAAGATGCTAAAGAACAGCCGGAGAAGGCC AAGAATGCTCTGAACATGATGGATCCAGAAAGTTTAGGCGATTCTATGGCCAACATGATGGTGATGGCAATGGATAAGGACCAGGATGGTGTGTTGTCAG AGGAGGAGATTGCCACCATGGTACAATCGGGAGAGACGGAGAACAAAGGCAAAGCAGAGGAGATGTTTGAACAGATGGATGAAGATGGCGATGGAGAGGT AACCAGAGACGAAGCGAAGGTGTACTTTTCAAAACTAGGAAACACCTTGCAAGGCCTTTCAAAAATGATGGGTGGTTCAAAATCAGAGTTGTGATTGGGA GAATCGCTCTGTCAACCGCCTGCGGTGGTGCGGTTGGAAGGTTGAGCTTGAAAGGTGCGAAGCGTCTCTCCATTGGTGTCCGAGATAGCCTGAGATAGCC TGAGATATTTAGGTGATACTGTATCTTCTTGGGTTTTCGGATGCAAATTTTTGACAACAGATCAGAAATCCATCCGAATATCCCGGCCGCCAGGGCAAAA ATCATAGAGTTTCCCTGTGCAGAAGCTGCAAGTTTTGAGTTTTTCTCATTCCATTGGGGCCTGAATTTCAAGAAAATCGTATAGTCTTA CGGACCGCGGCATTTGCCAATTTGCGCGTCGTCGGGGGTCGCCATGATGTTTCGCTTGGCAGGCTTTTTTGCTTTGGCACTGCTGGTCGCGGGAAAGCC CAAGGGTGGCAAAGGTGCAAAAGGAGAACAAGACCCCTTCTCTGAGCTTAGCCGCCTCGCAGACAATTTGAAAGATGCTAAAGAACAGCCGGAGAAGGCC AAGAATGCTCTGAACATGATGGATCCAGAAAGTTTAGGCGATTCTATGGCCAACATGATGGTGATGGCAATGGATAAGGACCAGGATGGTGTGTTGTCAG AGGAGGAGATTGCCACCATGGTACAATCGGGAGAGACGGAGAACAAAGGCAAAGCAGAGGAGATGTTTGAACAGATGGATGAAGATGGCGATGGAGAGGT AACCAGAGACGAAGCGAAGGTGTACTTTTCAAAACTAGGAAACACCTTGCAAGGCCTTTCAAAAATGATGGGTGGTTCAAAATCAGAGTTGTGATTGGGA GAATCGCTCTGTCAACCGCCTGCGGTGGTGCGGTTGGAAGGTTGAGCTTGAAAGGTGCGAAGCGTCTCTCCATTGGTGTCCGAGATAGCCTGAGATAGCC TGAGATATTTAGGTGATACTGTATCTTCTTGGGTTTTCGGATGCAAATTTTTGACAACAGATCAGAAATCCATCCGAATATCCCGGCCGCCAGGGCAAAA ATCATAGAGTTTCCCTGTGCAGAAGCTGCAAGTTTTGAGTTTTTCTCATTCCATTGGGGCCTGAATTTCAAGAAAATCGTATAGTCTTA GGACCGCGGCATTTGCCAATTTGCGCGTCGTCGGGGGTCGCCATGATGTTTCGCTTGGCAGGCTTTTTTGCTTTGGCACTGCTGGTCGCGGGAAAGCC CAAGGGTGGCAAAGGTGCAAAAGGAGAACAAGACCCCTTCTCTGAGCTTAGCCGCCTCGCAGACAATTTGAAAGATGCTAAAGAACAGCCGGAGAAGGCC AAGAATGCTCTGAACATGATGGATCCAGAAAGTTTAGGCGATTCTATGGCCAACATGATGGTGATGGCAATGGATAAGGACCAGGATGGTGTGTTGTCAG AGGAGGAGATTGCCACCATGGTACAATCGGGAGAGACGGAGAACAAAGGCAAAGCAGAGGAGATGTTTGAACAGATGGATGAAGATGGCGATGGAGAGGT AACCAGAGACGAAGCGAAGGTGTACTTTTCAAAACTAGGAAACACCTTGCAAGGCCTTTCAAAAATGATGGGTGGTTCAAAATCAGAGTTGTGATTGGGA GAATCGCTCTGTCAACCGCCTGCGGTGGTGCGGTTGGAAGGTTGAGCTTGAAAGGTGCGAAGCGTCTCTCCATTGGTGTCCGAGATAGCCTGAGATAGCC TGAGATATTTAGGTGATACTGTATCTTCTTGGGTTTTCGGATGCAAATTTTTGACAACAGATCAGAAATCCATCCGAATATCCCGGCCGCCAGGGCAAAA ATCATAGAGTTTCCCTGTGCAGAAGCTGCAAGTTTTGAGTTTTTCTCATTCCATTGGGGCCTGAATTTCAAGAAAATCGTATAGTCTTA

    open ($input, $ARGV[0]) or die ("Could not open input file $ARGV[0].\n +"); open ($output, '>output.txt') or die ("Could not open input file outpu +t.txt.\n"); while ($line = <$input>) { chomp($line); unless ($line =~ m/>/) { $line = uc($line); $seq .= $line; } } $orf1 = $seq; $orf2 = substr($seq, 1); $orf3 = substr($seq, 2);
    OK, so the problem is a little different than what I thought earlier. What I'm trying to get is for $seq to be a continuous string of letters, but as you can see, when I use substr it deletes characters on the first line like it should but doesn't shift the other lines (i.e. the GCC on the first line should connect right to the CAA that begins the second line of the output without any spaces or newlines). I'm guessing this is some sort of newline issue? But I used chomp, so I'm confused.
      Added this $line =~ s/\r//; and it fixed the issue. Newlines/carriage returns can be a huge pain.