missing character when reading input file

JediGorf has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: missing character when reading input file by Eily (Monsignor) on Sep 11, 2018 at 15:38 UTC
At first it looked like there were several lines of data (ignoring the first two lines), but there is only one, and the pieces of sequence are actually on the same line, separated by spaces. So if what you call "missing characters" are actually spaces in the output that's because they are already there in the input. If that's not the issue, you'll have to show the expected output and the one you actually get.	[reply]
Re^2: missing character when reading input file by JediGorf (Initiate) on Sep 11, 2018 at 15:47 UTC
As an example: Input of first line: acggaccgcggcatttgccaatttgcgcgtcgtcgggggtcgccatgatgtttcgcttggcaggcttttttgctttggcactgctggtcgcgggaaagcc Output of first line: ACGGACCGCGGCATTTGCCAATTTGCGCGTCGTCGGGGGTCGCCATGATGTTTCGCTTGGCAGGCTTTTTTGCTTTGGCA So the sequence: ctgctggtcgcgggaaagcc is completely gone from my output. Interestingly, the output is 80 chars while the input is 100, so the missing chunk is exactly 20. Is Perl somehow limited to only 80 chars in a line of text?	[reply]
Re^3: missing character when reading input file by AnomalousMonk (Archbishop) on Sep 11, 2018 at 16:10 UTC
Still not a lot of info to do \| go on, but let me suggest you have "invisible" characters lurking in your input: `c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $seq = qq{what is wrong with this\x0dline}; print $seq; dd $seq; " line is wrong with this "what is wrong with this\rline"` [download] Some kind of data dump (Data::Dumper, Data::Dump) might be informative. (`Data::Dumper` is core and so should already be installed.) Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^4: missing character when reading input file by Eily (Monsignor) on Sep 11, 2018 at 16:19 UTC
Re^3: missing character when reading input file by Eily (Monsignor) on Sep 11, 2018 at 15:59 UTC
Perl has no such limitations, so it comes from the way you display your output. I'm guessing this is on a terminal? If so, try forwarding it to a file instead	[reply]
Re^3: missing character when reading input file by AnomalousMonk (Archbishop) on Sep 11, 2018 at 19:28 UTC
Further to Eily's post: 80 chars really sounds like a display limit ... JediGorf: Is the truncation always at 80 characters? Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^2: missing character when reading input file by JediGorf (Initiate) on Sep 11, 2018 at 15:53 UTC
With a space between the "lines" instead of a carriage return, is Perl not getting each "line" from: `while ($line = <$input>) { }`? Never mind. When I look at the original input file that I'm using, those are carriage returns at the ends of the lines.	[reply] [d/l]
Re^3: missing character when reading input file by poj (Abbot) on Sep 11, 2018 at 16:03 UTC
Is that the complete input file. If not how long is the last line ? poj	[reply]
Re^3: missing character when reading input file by AnomalousMonk (Archbishop) on Sep 12, 2018 at 00:05 UTC
When I look at the original input file ... those are carriage returns at the ends of the lines. Is it literally true that you are seeing "carriage return" (ASCII 0x0d) characters at the ends of your lines? If so, that may be a big part of your problem. On what sort of system do the files originate? Windows? On what sort of system are you processing the files? *nix? Line-ending delimiters are different on these two systems (and on others). Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^2: missing character when reading input file by JediGorf (Initiate) on Sep 12, 2018 at 00:33 UTC
Input: >symbB.v1.2.017277.t1\|scaffold1325.1\|size176917\|3 acggaccgcggcatttgccaatttgcgcgtcgtcgggggtcgccatgatgtttcgcttggcaggcttttttgctttggcactgctggtcgcgggaaagcc caagggtggcaaaggtgcaaaaggagaacaagaccccttctctgagcttagccgcctcgcagacaatttgaaagatgctaaagaacagccggagaaggcc aagaatgctctgaacatgatggatccagaaagtttaggcgattctatggccaacatgatggtgatggcaatggataaggaccaggatggtgtgttgtcag aggaggagattgccaccatggtacaatcgggagagacggagaacaaaggcaaagcagaggagatgtttgaacagatggatgaagatggcgatggagaggt aaccagagacgaagcgaaggtgtacttttcaaaactaggaaacaccttgcaaggcctttcaaaaatgatgggtggttcaaaatcagagttgtgattggga gaatcgctctgtcaaccgcctgcggtggtgcggttggaaggttgagcttgaaaggtgcgaagcgtctctccattggtgtccgagatagcctgagatagcc tgagatatttaggtgatactgtatcttcttgggttttcggatgcaaatttttgacaacagatcagaaatccatccgaatatcccggccgccagggcaaaa atcatagagtttccctgtgcagaagctgcaagttttgagtttttctcattccattggggcctgaatttcaagaaaatcgtatagtctta Output: ACGGACCGCGGCATTTGCCAATTTGCGCGTCGTCGGGGGTCGCCATGATGTTTCGCTTGGCAGGCTTTTTTGCTTTGGCACTGCTGGTCGCGGGAAAGCC CAAGGGTGGCAAAGGTGCAAAAGGAGAACAAGACCCCTTCTCTGAGCTTAGCCGCCTCGCAGACAATTTGAAAGATGCTAAAGAACAGCCGGAGAAGGCC AAGAATGCTCTGAACATGATGGATCCAGAAAGTTTAGGCGATTCTATGGCCAACATGATGGTGATGGCAATGGATAAGGACCAGGATGGTGTGTTGTCAG AGGAGGAGATTGCCACCATGGTACAATCGGGAGAGACGGAGAACAAAGGCAAAGCAGAGGAGATGTTTGAACAGATGGATGAAGATGGCGATGGAGAGGT AACCAGAGACGAAGCGAAGGTGTACTTTTCAAAACTAGGAAACACCTTGCAAGGCCTTTCAAAAATGATGGGTGGTTCAAAATCAGAGTTGTGATTGGGA GAATCGCTCTGTCAACCGCCTGCGGTGGTGCGGTTGGAAGGTTGAGCTTGAAAGGTGCGAAGCGTCTCTCCATTGGTGTCCGAGATAGCCTGAGATAGCC TGAGATATTTAGGTGATACTGTATCTTCTTGGGTTTTCGGATGCAAATTTTTGACAACAGATCAGAAATCCATCCGAATATCCCGGCCGCCAGGGCAAAA ATCATAGAGTTTCCCTGTGCAGAAGCTGCAAGTTTTGAGTTTTTCTCATTCCATTGGGGCCTGAATTTCAAGAAAATCGTATAGTCTTA CGGACCGCGGCATTTGCCAATTTGCGCGTCGTCGGGGGTCGCCATGATGTTTCGCTTGGCAGGCTTTTTTGCTTTGGCACTGCTGGTCGCGGGAAAGCC CAAGGGTGGCAAAGGTGCAAAAGGAGAACAAGACCCCTTCTCTGAGCTTAGCCGCCTCGCAGACAATTTGAAAGATGCTAAAGAACAGCCGGAGAAGGCC AAGAATGCTCTGAACATGATGGATCCAGAAAGTTTAGGCGATTCTATGGCCAACATGATGGTGATGGCAATGGATAAGGACCAGGATGGTGTGTTGTCAG AGGAGGAGATTGCCACCATGGTACAATCGGGAGAGACGGAGAACAAAGGCAAAGCAGAGGAGATGTTTGAACAGATGGATGAAGATGGCGATGGAGAGGT AACCAGAGACGAAGCGAAGGTGTACTTTTCAAAACTAGGAAACACCTTGCAAGGCCTTTCAAAAATGATGGGTGGTTCAAAATCAGAGTTGTGATTGGGA GAATCGCTCTGTCAACCGCCTGCGGTGGTGCGGTTGGAAGGTTGAGCTTGAAAGGTGCGAAGCGTCTCTCCATTGGTGTCCGAGATAGCCTGAGATAGCC TGAGATATTTAGGTGATACTGTATCTTCTTGGGTTTTCGGATGCAAATTTTTGACAACAGATCAGAAATCCATCCGAATATCCCGGCCGCCAGGGCAAAA ATCATAGAGTTTCCCTGTGCAGAAGCTGCAAGTTTTGAGTTTTTCTCATTCCATTGGGGCCTGAATTTCAAGAAAATCGTATAGTCTTA GGACCGCGGCATTTGCCAATTTGCGCGTCGTCGGGGGTCGCCATGATGTTTCGCTTGGCAGGCTTTTTTGCTTTGGCACTGCTGGTCGCGGGAAAGCC CAAGGGTGGCAAAGGTGCAAAAGGAGAACAAGACCCCTTCTCTGAGCTTAGCCGCCTCGCAGACAATTTGAAAGATGCTAAAGAACAGCCGGAGAAGGCC AAGAATGCTCTGAACATGATGGATCCAGAAAGTTTAGGCGATTCTATGGCCAACATGATGGTGATGGCAATGGATAAGGACCAGGATGGTGTGTTGTCAG AGGAGGAGATTGCCACCATGGTACAATCGGGAGAGACGGAGAACAAAGGCAAAGCAGAGGAGATGTTTGAACAGATGGATGAAGATGGCGATGGAGAGGT AACCAGAGACGAAGCGAAGGTGTACTTTTCAAAACTAGGAAACACCTTGCAAGGCCTTTCAAAAATGATGGGTGGTTCAAAATCAGAGTTGTGATTGGGA GAATCGCTCTGTCAACCGCCTGCGGTGGTGCGGTTGGAAGGTTGAGCTTGAAAGGTGCGAAGCGTCTCTCCATTGGTGTCCGAGATAGCCTGAGATAGCC TGAGATATTTAGGTGATACTGTATCTTCTTGGGTTTTCGGATGCAAATTTTTGACAACAGATCAGAAATCCATCCGAATATCCCGGCCGCCAGGGCAAAA ATCATAGAGTTTCCCTGTGCAGAAGCTGCAAGTTTTGAGTTTTTCTCATTCCATTGGGGCCTGAATTTCAAGAAAATCGTATAGTCTTA `open ($input, $ARGV[0]) or die ("Could not open input file $ARGV[0].\n +"); open ($output, '>output.txt') or die ("Could not open input file outpu +t.txt.\n"); while ($line = <$input>) { chomp($line); unless ($line =~ m/>/) { $line = uc($line); $seq .= $line; } } $orf1 = $seq; $orf2 = substr($seq, 1); $orf3 = substr($seq, 2);` [download] OK, so the problem is a little different than what I thought earlier. What I'm trying to get is for $seq to be a continuous string of letters, but as you can see, when I use substr it deletes characters on the first line like it should but doesn't shift the other lines (i.e. the GCC on the first line should connect right to the CAA that begins the second line of the output without any spaces or newlines). I'm guessing this is some sort of newline issue? But I used chomp, so I'm confused.	[reply] [d/l]
Re^3: missing character when reading input file by JediGorf (Initiate) on Sep 12, 2018 at 00:47 UTC
Added this `$line =~ s/\r//;` and it fixed the issue. Newlines/carriage returns can be a huge pain.	[reply] [d/l]
Re: missing character when reading input file by BillKSmith (Monsignor) on Sep 12, 2018 at 04:03 UTC
It now appears that your data file was created on a windows system and you are reading it with a unix-like system. Perl can handle this, but you have to tell it using IO layers in the open statement. (Refer perlio) use strict; use warnings; BEGIN { my $win_file # memory file simulates windows file. = ">symbB.v1.2.017277.t1\|scaffold1325.1\|size176917\|3\r\n" . "acggaccgcggcatttgccaatttgcgcgt" . "cgtcgggggtcgccatgatgtttcgcttgg" . "caggcttttttgctttggcactgctggtcg" . "cgggaaagcc\r\n" . "caagggtggcaaaggtgcaaaaggagaaca" . "agaccccttctctgagcttagccgcctcgc" . "agacaatttgaaagatgctaaagaacagcc" . "ggagaaggcc\r\n" . "aagaatgctctgaacatgatggatccagaa" . "agtttaggcgattctatggccaacatgatg" . "gtgatggcaatggataaggaccaggatggt" . "gtgttgtcag\r\n" ; $ARGV[0] = \do{$win_file}; } open( my $input, '<:crlf', $ARGV[0] ) or die( "Could not open input file $ARGV[0].\n" ); my $seq; while ( my $line = <$input> ) { chomp($line); unless ( $line =~ m/>/ ) { $line = uc($line); $seq .= $line; } } print "Length of \$seq is ", length($seq), " characters\n"; [download] Bill	[reply] [d/l]
Re^2: [OT]: missing character when reading input file by AnomalousMonk (Archbishop) on Sep 12, 2018 at 17:41 UTC
I agree with your example of using open to handle a CRLFish file on a nix system. The rest of this post is just to satisfy my curiosity. `$ARGV[0] = \do{$win_file};`* I don't understand the purpose of munging `@ARGV` in this way. Any scalar can be opened by reference as a RAM file. If the reason for initializing the scalar in a `BEGIN` block was to create a lexically private scalar, then assigning a reference to it to an element of the global `@ARGV` array defeats this purpose. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: [OT]: missing character when reading input file by BillKSmith (Monsignor) on Sep 12, 2018 at 18:36 UTC
The array @ARGV was used to preserve as much of the original OP as possible. The 'BEGIN' is unnecessary, but I feel that it serves to separate the file simulation from the relevant code. No excuse for the 'do' block. It is a leftover from an earlier attempt at the file simulation. Bill	[reply]
Re^4: [OT]: missing character when reading input file by AnomalousMonk (Archbishop) on Sep 12, 2018 at 18:56 UTC
Re^2: missing character when reading input file by bliako (Abbot) on Sep 12, 2018 at 09:03 UTC
hey that's a cool new trick I learned today, setting `$ARGV[0] = \do{$win_file};` I like your method as an easy way to produce a CRLF file in non-windows machine (e.g. for tests).	[reply] [d/l]