Re: missing character when reading input file
by Eily (Monsignor) on Sep 11, 2018 at 15:38 UTC
|
At first it looked like there were several lines of data (ignoring the first two lines), but there is only one, and the pieces of sequence are actually on the same line, separated by spaces. So if what you call "missing characters" are actually spaces in the output that's because they are already there in the input.
If that's not the issue, you'll have to show the expected output and the one you actually get.
| [reply] |
|
|
As an example:
Input of first line: acggaccgcggcatttgccaatttgcgcgtcgtcgggggtcgccatgatgtttcgcttggcaggcttttttgctttggcactgctggtcgcgggaaagcc
Output of first line: ACGGACCGCGGCATTTGCCAATTTGCGCGTCGTCGGGGGTCGCCATGATGTTTCGCTTGGCAGGCTTTTTTGCTTTGGCA
So the sequence: ctgctggtcgcgggaaagcc is completely gone from my output. Interestingly, the output is 80 chars while the input is 100, so the missing chunk is exactly 20. Is Perl somehow limited to only 80 chars in a line of text?
| [reply] |
|
|
c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le
"my $seq = qq{what is wrong with this\x0dline};
print $seq;
dd $seq;
"
line is wrong with this
"what is wrong with this\rline"
Some kind of data dump (Data::Dumper, Data::Dump) might be informative. (Data::Dumper is core and so should already be installed.)
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
|
|
|
| [reply] |
|
|
| [reply] [d/l] |
|
|
With a space between the "lines" instead of a carriage return, is Perl not getting each "line" from:
while ($line = <$input>) { }?
Never mind. When I look at the original input file that I'm using, those are carriage returns at the ends of the lines.
| [reply] [d/l] |
|
|
| [reply] |
|
|
When I look at the original input file ... those are carriage returns at the ends of the lines.
Is it literally true that you are seeing "carriage return" (ASCII 0x0d) characters at the ends of your lines? If so, that may be a big part of your problem. On what sort of system do the files originate? Windows? On what sort of system are you processing the files? *nix? Line-ending delimiters are different on these two systems (and on others).
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] |
|
|
Input:>symbB.v1.2.017277.t1|scaffold1325.1|size176917|3
acggaccgcggcatttgccaatttgcgcgtcgtcgggggtcgccatgatgtttcgcttggcaggcttttttgctttggcactgctggtcgcgggaaagcc
caagggtggcaaaggtgcaaaaggagaacaagaccccttctctgagcttagccgcctcgcagacaatttgaaagatgctaaagaacagccggagaaggcc
aagaatgctctgaacatgatggatccagaaagtttaggcgattctatggccaacatgatggtgatggcaatggataaggaccaggatggtgtgttgtcag
aggaggagattgccaccatggtacaatcgggagagacggagaacaaaggcaaagcagaggagatgtttgaacagatggatgaagatggcgatggagaggt
aaccagagacgaagcgaaggtgtacttttcaaaactaggaaacaccttgcaaggcctttcaaaaatgatgggtggttcaaaatcagagttgtgattggga
gaatcgctctgtcaaccgcctgcggtggtgcggttggaaggttgagcttgaaaggtgcgaagcgtctctccattggtgtccgagatagcctgagatagcc
tgagatatttaggtgatactgtatcttcttgggttttcggatgcaaatttttgacaacagatcagaaatccatccgaatatcccggccgccagggcaaaa
atcatagagtttccctgtgcagaagctgcaagttttgagtttttctcattccattggggcctgaatttcaagaaaatcgtatagtctta
Output:ACGGACCGCGGCATTTGCCAATTTGCGCGTCGTCGGGGGTCGCCATGATGTTTCGCTTGGCAGGCTTTTTTGCTTTGGCACTGCTGGTCGCGGGAAAGCC
CAAGGGTGGCAAAGGTGCAAAAGGAGAACAAGACCCCTTCTCTGAGCTTAGCCGCCTCGCAGACAATTTGAAAGATGCTAAAGAACAGCCGGAGAAGGCC
AAGAATGCTCTGAACATGATGGATCCAGAAAGTTTAGGCGATTCTATGGCCAACATGATGGTGATGGCAATGGATAAGGACCAGGATGGTGTGTTGTCAG
AGGAGGAGATTGCCACCATGGTACAATCGGGAGAGACGGAGAACAAAGGCAAAGCAGAGGAGATGTTTGAACAGATGGATGAAGATGGCGATGGAGAGGT
AACCAGAGACGAAGCGAAGGTGTACTTTTCAAAACTAGGAAACACCTTGCAAGGCCTTTCAAAAATGATGGGTGGTTCAAAATCAGAGTTGTGATTGGGA
GAATCGCTCTGTCAACCGCCTGCGGTGGTGCGGTTGGAAGGTTGAGCTTGAAAGGTGCGAAGCGTCTCTCCATTGGTGTCCGAGATAGCCTGAGATAGCC
TGAGATATTTAGGTGATACTGTATCTTCTTGGGTTTTCGGATGCAAATTTTTGACAACAGATCAGAAATCCATCCGAATATCCCGGCCGCCAGGGCAAAA
ATCATAGAGTTTCCCTGTGCAGAAGCTGCAAGTTTTGAGTTTTTCTCATTCCATTGGGGCCTGAATTTCAAGAAAATCGTATAGTCTTA
CGGACCGCGGCATTTGCCAATTTGCGCGTCGTCGGGGGTCGCCATGATGTTTCGCTTGGCAGGCTTTTTTGCTTTGGCACTGCTGGTCGCGGGAAAGCC
CAAGGGTGGCAAAGGTGCAAAAGGAGAACAAGACCCCTTCTCTGAGCTTAGCCGCCTCGCAGACAATTTGAAAGATGCTAAAGAACAGCCGGAGAAGGCC
AAGAATGCTCTGAACATGATGGATCCAGAAAGTTTAGGCGATTCTATGGCCAACATGATGGTGATGGCAATGGATAAGGACCAGGATGGTGTGTTGTCAG
AGGAGGAGATTGCCACCATGGTACAATCGGGAGAGACGGAGAACAAAGGCAAAGCAGAGGAGATGTTTGAACAGATGGATGAAGATGGCGATGGAGAGGT
AACCAGAGACGAAGCGAAGGTGTACTTTTCAAAACTAGGAAACACCTTGCAAGGCCTTTCAAAAATGATGGGTGGTTCAAAATCAGAGTTGTGATTGGGA
GAATCGCTCTGTCAACCGCCTGCGGTGGTGCGGTTGGAAGGTTGAGCTTGAAAGGTGCGAAGCGTCTCTCCATTGGTGTCCGAGATAGCCTGAGATAGCC
TGAGATATTTAGGTGATACTGTATCTTCTTGGGTTTTCGGATGCAAATTTTTGACAACAGATCAGAAATCCATCCGAATATCCCGGCCGCCAGGGCAAAA
ATCATAGAGTTTCCCTGTGCAGAAGCTGCAAGTTTTGAGTTTTTCTCATTCCATTGGGGCCTGAATTTCAAGAAAATCGTATAGTCTTA
GGACCGCGGCATTTGCCAATTTGCGCGTCGTCGGGGGTCGCCATGATGTTTCGCTTGGCAGGCTTTTTTGCTTTGGCACTGCTGGTCGCGGGAAAGCC
CAAGGGTGGCAAAGGTGCAAAAGGAGAACAAGACCCCTTCTCTGAGCTTAGCCGCCTCGCAGACAATTTGAAAGATGCTAAAGAACAGCCGGAGAAGGCC
AAGAATGCTCTGAACATGATGGATCCAGAAAGTTTAGGCGATTCTATGGCCAACATGATGGTGATGGCAATGGATAAGGACCAGGATGGTGTGTTGTCAG
AGGAGGAGATTGCCACCATGGTACAATCGGGAGAGACGGAGAACAAAGGCAAAGCAGAGGAGATGTTTGAACAGATGGATGAAGATGGCGATGGAGAGGT
AACCAGAGACGAAGCGAAGGTGTACTTTTCAAAACTAGGAAACACCTTGCAAGGCCTTTCAAAAATGATGGGTGGTTCAAAATCAGAGTTGTGATTGGGA
GAATCGCTCTGTCAACCGCCTGCGGTGGTGCGGTTGGAAGGTTGAGCTTGAAAGGTGCGAAGCGTCTCTCCATTGGTGTCCGAGATAGCCTGAGATAGCC
TGAGATATTTAGGTGATACTGTATCTTCTTGGGTTTTCGGATGCAAATTTTTGACAACAGATCAGAAATCCATCCGAATATCCCGGCCGCCAGGGCAAAA
ATCATAGAGTTTCCCTGTGCAGAAGCTGCAAGTTTTGAGTTTTTCTCATTCCATTGGGGCCTGAATTTCAAGAAAATCGTATAGTCTTA
open ($input, $ARGV[0]) or die ("Could not open input file $ARGV[0].\n
+");
open ($output, '>output.txt') or die ("Could not open input file outpu
+t.txt.\n");
while ($line = <$input>) {
chomp($line);
unless ($line =~ m/>/) {
$line = uc($line);
$seq .= $line;
}
}
$orf1 = $seq;
$orf2 = substr($seq, 1);
$orf3 = substr($seq, 2);
OK, so the problem is a little different than what I thought earlier. What I'm trying to get is for $seq to be a continuous string of letters, but as you can see, when I use substr it deletes characters on the first line like it should but doesn't shift the other lines (i.e. the GCC on the first line should connect right to the CAA that begins the second line of the output without any spaces or newlines). I'm guessing this is some sort of newline issue? But I used chomp, so I'm confused. | [reply] [d/l] |
|
|
Added this $line =~ s/\r//; and it fixed the issue. Newlines/carriage returns can be a huge pain.
| [reply] [d/l] |
Re: missing character when reading input file
by BillKSmith (Monsignor) on Sep 12, 2018 at 04:03 UTC
|
It now appears that your data file was created on a windows system and you are reading it with a unix-like system. Perl can handle this, but you have to tell it using IO layers in the open statement.
(Refer perlio)
use strict;
use warnings;
BEGIN {
my $win_file # memory file simulates windows file.
= ">symbB.v1.2.017277.t1|scaffold1325.1|size176917|3\r\n"
. "acggaccgcggcatttgccaatttgcgcgt"
. "cgtcgggggtcgccatgatgtttcgcttgg"
. "caggcttttttgctttggcactgctggtcg"
. "cgggaaagcc\r\n"
. "caagggtggcaaaggtgcaaaaggagaaca"
. "agaccccttctctgagcttagccgcctcgc"
. "agacaatttgaaagatgctaaagaacagcc"
. "ggagaaggcc\r\n"
. "aagaatgctctgaacatgatggatccagaa"
. "agtttaggcgattctatggccaacatgatg"
. "gtgatggcaatggataaggaccaggatggt"
. "gtgttgtcag\r\n"
;
$ARGV[0] = \do{$win_file};
}
open( my $input, '<:crlf', $ARGV[0] )
or die( "Could not open input file $ARGV[0].\n" );
my $seq;
while ( my $line = <$input> ) {
chomp($line);
unless ( $line =~ m/>/ ) {
$line = uc($line);
$seq .= $line;
}
}
print "Length of \$seq is ", length($seq), " characters\n";
| [reply] [d/l] |
|
|
I agree with your example of using open to handle a CRLFish file on a *nix system. The rest of this post is just to satisfy my curiosity.
$ARGV[0] = \do{$win_file};
I don't understand the purpose of munging @ARGV in this way. Any scalar can be opened by reference as a RAM file. If the reason for initializing the scalar in a BEGIN block was to create a lexically private scalar, then assigning a reference to it to an element of the global @ARGV array defeats this purpose.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
|
The array @ARGV was used to preserve as much of the original OP as possible. The 'BEGIN' is unnecessary, but I feel that it serves to separate the file simulation from the relevant code. No excuse for the 'do' block. It is a leftover from an earlier attempt at the file simulation.
| [reply] |
|
|
|
|
hey that's a cool new trick I learned today, setting $ARGV[0] = \do{$win_file};
I like your method as an easy way to produce a CRLF file in non-windows machine (e.g. for tests).
| [reply] [d/l] |