rjbioinf has asked for the wisdom of the Perl Monks concerning the following question:
Dear Monks,
I am having difficulty reading in lines from a txt file that contains text that was copied from a pdf file.
I shortened the text file to 3 lines to take up less space. It looks as follows:
AA 12 BB 34 CC 56
I want to read each line, one line at a time. However, I cannot find a way to do this. I check the ftxt file in a Hex editor and it shows that there is a carriage return at the end of each line. I try to deal with this shown below but it only prints the final line plus some strange stuff goes on and it inserts bits of another line in there somewhere and fails to print the preceeding '>'.
open( FH, $f ) or die; while( my $str = <FH> ){ $str =~ s/\r\n//g; print ">$str<\n"; } close(FH)
# Output:
CCA 56<
If I change s/\r\n//g; to s/\r//g; then it prints everything:
# Output:
>AA 12BB 34CC 56<
I also tried s/[^[:ascii:]]//g; and tr/\x80-\xFF//d; but they do not solve the problem.
Some strange invisible or non-ascii characters from the pdf file are likely the cause of this but I am now stumped as to solve this problem.
Obviously, an answer is "Do not copy text from pdf files!", but I hope someone can help me out with a Perl solution. My work around at the moment is to read the contents of the file into a matrix in R (the language) and then export that matrix to a file, which Perl then has no trouble reading one line at a time.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Dealing with non-ascii characters when reading file.
by AnomalousMonk (Archbishop) on Sep 25, 2014 at 09:38 UTC | |
|
Re: Dealing with non-ascii characters when reading file.
by graff (Chancellor) on Sep 26, 2014 at 02:46 UTC | |
|
Re: Dealing with non-ascii characters when reading file.
by kzwix (Sexton) on Sep 25, 2014 at 11:42 UTC | |
|
Re: Dealing with non-ascii characters when reading file.
by Anonymous Monk on Sep 25, 2014 at 09:02 UTC |