Dealing with non-ascii characters when reading file.

rjbioinf has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am having difficulty reading in lines from a txt file that contains text that was copied from a pdf file.

I shortened the text file to 3 lines to take up less space. It looks as follows:

AA    12
BB    34
CC    56
[download]

I want to read each line, one line at a time. However, I cannot find a way to do this. I check the ftxt file in a Hex editor and it shows that there is a carriage return at the end of each line. I try to deal with this shown below but it only prints the final line plus some strange stuff goes on and it inserts bits of another line in there somewhere and fails to print the preceeding '>'.

open( FH, $f ) or die;

while( my $str = <FH> ){
    
    $str =~ s/\r\n//g;
    print ">$str<\n";
}
close(FH)
[download]

# Output:

CCA 56<

If I change s/\r\n//g; to s/\r//g; then it prints everything:

# Output:

>AA 12BB 34CC 56<

I also tried s/[^[:ascii:]]//g; and tr/\x80-\xFF//d; but they do not solve the problem.

Some strange invisible or non-ascii characters from the pdf file are likely the cause of this but I am now stumped as to solve this problem.

Obviously, an answer is "Do not copy text from pdf files!", but I hope someone can help me out with a Perl solution. My work around at the moment is to read the contents of the file into a matrix in R (the language) and then export that matrix to a file, which Perl then has no trouble reading one line at a time.

Comment on Dealing with non-ascii characters when reading file. Select or Download Code

Replies are listed 'Best First'.
Re: Dealing with non-ascii characters when reading file. by AnomalousMonk (Archbishop) on Sep 25, 2014 at 09:38 UTC
If I change s/\r\n//g; to s/\r//g; then it prints everything ... The substitution `s/\r\n//g` changes all exact sequences of `\r\n` to the empty string, and apparently there is no such sequence present for you say no change occurs. This might have worked better had you used the `[\r\n]` character class instead. The substitution `s/\r//g` removes all `\r` (carriage-return) characters, which are, I think, the root of your problem. The substituiton `s/[^[:ascii:]]//g` removes all non-ASCII characters, but `\r` and `\n` are ASCII characters! The transliteration `tr/\x80-\xFF//d` removes all 8-bit characters outside the ASCII range, but `\r \n` are still ASCII characters. Update: Adding code example: `c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{ab-cd\tef\n\x80\x81\xaa\xab\xacfoo\rbar}; print qq{[[$s]] \n}; ;; $s =~ s/\r//g; print qq{[[$s]] \n}; ;; $s =~ s/[^[:ascii:]]//g; print qq{[[$s]] \n}; " [[ab-cd ef bar]] oo [[ab-cd ef И狠審oobar]] [[ab-cd ef foobar]]` [download]	[reply] [d/l] [select]
Re: Dealing with non-ascii characters when reading file. by graff (Chancellor) on Sep 26, 2014 at 02:46 UTC
If I think there's something hinky about a file because it contains "unexpected" byte values, I would check its inventory of byte values, with something like this: `#!/usr/bin/perl use strict; use warnings; die "Usage: $0 file.name\n" unless ( @ARGV == 1 and -f $ARGV[0] ); open( FH, shift ); binmode FH; $/ = undef; $_ = <FH>; my %char_hist; for my $c ( split // ) { $char_hist{ sprintf( "%02x", ord( $c )) }++; } for my $c ( sort keys %char_hist ) { printf "%s\t%d\n", $c, $char_hist{$c}; }` [download] (That's just a toy version to try it out on files that aren't seriously large. I'd do it differently for general use.) It's sometimes surprising what you can learn about a file just by looking at a histogram of its byte values - seeing which values occur, and which ones don't. (If you happen to know that a file contains utf8-encoded text, you can learn a lot by looking at a histogram of its Unicode characters - I posted a script for that too: unichist -- count/summarize characters in data.	[reply] [d/l]
Re: Dealing with non-ascii characters when reading file. by kzwix (Sexton) on Sep 25, 2014 at 11:42 UTC
If you are reading a file where line end sequences are 'CR' followed by 'LF' (that is, "\r\n"), and you are sure that each line is ended in this fashion, then you should tell Perl so: Just set the special variable $/ to the end-of-line sequence. In your case, that would be: $/ = "\r\n"; Then, <FILE> will work as expected, as will chomp, and all other functions or operators expecting line endings to match the sequence in $/ Also noteworthy is the opposite variable, $\, which gets appended after all print operations, so you may save some time and have more readable lines if you use that. If your problem is that the lines are not recognized, or you're not sure whether they end in "\n" or "\r\n", then just use $/ = "\n";, and read your file. Then, for each line, just s/\r$//;, if you care only for readability, and don't need to keep "control" characters in the flow.	[reply]
Re: Dealing with non-ascii characters when reading file. by Anonymous Monk on Sep 25, 2014 at 09:02 UTC
Try providing sample data with ddumper, `use Path::Tiny qw/ path /; dd( path( "infile" )->lines_raw( { count => 3 } ) ); sub dd { use Data::Dumper; print Data::Dumper->new([@_])->Sortkeys(1) ->Indent(1)->Useqq(1)->Dump . "\n"; }` [download]	[reply] [d/l]