http://qs1969.pair.com?node_id=870638

jimhenry has asked for the wisdom of the Perl Monks concerning the following question:

I have a script which needs to be able to deal with a wide variety of text files -- potentially, any text file on the filesystem. It's not practical to require the user to specify the encoding for every file. A first draft of it handles ASCII and UTF-8 correctly, but mangles the high-bit characters in Latin-1 or UTF-16 text files.

I've been studying on how to handle this, and it seems I need to guess the encoding of various input files, explicitly decode them to Perl's internal format for processing, then explicitly encode them to utf8 when outputting. I wrote a separate, smaller test script to try out Encode::Guess. Reading its man page and various older threads here, it looks as though Encode::Guess ought to be reliable at distinguishing ASCII vs Latin-1 vs. utf-8, but in fact, it's not even doing that for me:

#! /usr/bin/perl -w use strict; use Encode; # cp437 should be the old IBM PC character set, which is used in a # lot of Gutenberg etexts from the 1990s #use Encode::Guess qw( iso-8859-1 cp437 ); use Encode::Guess qw( iso-8859-1 ); #use Encode::Guess; foreach ( @ARGV ) { open FH, $_ or die qq(can't open "$_" for reading\n); my @lines = <FH>; close FH; my $content = join '\n', @lines; my $decoder = eval { Encode::Guess->guess( $content ); }; if ( $@ ) { my $eval_err = chomp $@; print qq($_: Encode::Guess->guess() failed horribly: "$eval_err"\n +); next; } if ( ref $decoder ) { print "$_: appears to be " . $decoder->name . "\n"; } else { print "$_: bad decoder returned by Encode::Guess->guess() "; print ( ( defined $decoder ) ? $decoder : "(undefined)" ); print "\n"; } }

On a directory containing a mix of ASCII, Latin-1, utf-8, and UTF-16 files, this prints accurate and confident identification of everything except for the utf-8 files, which get messages like this:

/home/jim/Documents/homepage/gzb/drafts/universal-violations.txt: bad +decoder returned by Encode::Guess->guess() iso-8859-1 or utf8
I suppose I could plug code like that into my main script, and assume that any file whose contents get a bad decoder is utf8, but that seems unwise. Or I could take any file that gets a "bad decoder" message and see if it has a pervasive pattern of always having two high-bit characters in a row, but that might be wasteful if there's an existing module that could do that. Does anyone have a better idea about distinguishing utf8 from Latin-1 or other 8-bit encodings?

In the script above, there's a line commented out giving cp437 as one of the defaults to initialize Encode::Guess; if I have that line in, every file with high-bit characters gets a bad decoder. This is not surprising, given the man page's warning that Encode::Guess is bad at distinguishing different 8-bit encodings from each other. I have a lot of cp437 etexts lying around, but I'm pretty sure I can write an ad-hoc routine to distinguish them from the Latin-1 text files -- in theory both code pages use all the characters from 0x80 to 0xFF, but in practice, only accented Latin letters characters in the 80 to A5 range are common in cp437 text files and only characters in the C0 to FF range are common in Latin-1 files.

Replies are listed 'Best First'.
Re: Problem with Encode::Guess
by ikegami (Patriarch) on Nov 10, 2010 at 18:59 UTC

    Newer versions should return

    Encodings too ambiguous: iso-8859-1 or utf8

    instead of

    iso-8859-1 or utf8

    Keep in mind that valid UTF-8 is also valid iso-latin-1. It favours UTF-8 if the document starts with a BOM encoded using UTF-8, Otherwise, I think valid UTF-8 will be considered possible iso-latin-1.

    but in practice, only accented Latin letters characters in the 80 to A5 range are common in cp437 text files and only characters in the C0 to FF range are common in Latin-1 files.

    Encode::Guess just isn't that fuzzy. It's actually very simplistic. It does not appear to be suitable for your task.

      OK, I've written a wrapper for Encode::Guess to handle the ambiguous encodings and unknown encodings errors, and to distinguish between cp437 and Latin-1. It seems to be pretty accurate, and decoder objects it produces handle those encodings plus utf8 just fine. However, my main script, using the decoder object returned by that function, is failing when it tries to use a UTF-16 decoder to decode a line of UTF-16 text; it dies with an error like this:
      UTF-16:Unrecognised BOM 6e at /home/jim/Documents/scripts/textual-slid +eshow.pl line 549, <FILE> line 10.

      Is it expecting every line of a UTF-16 file to have a byte-order mark, or for me to decode the entire file in one call to decode()? neither of those seems reasonable.

      Here is a stripped-down version of my script that reproduces the error without a lot of irrelevant code. The full script this is taken from is at http://jimhenry.conlang.org/scripts/textual-slideshow.zip.

      #! /usr/bin/perl -w # stripped-down version of textual-slideshow.pl # for reproducing bad byte-order-mark defect use Term::ReadKey; use strict; use warnings; use Encode; use Encode::Guess qw( iso-8859-1 ); #use open qw( :std :encoding(UTF-8) ); # proportion between file size and probability that a given # paragraph will be saved for later display. the smaller this number, # the fewer paragraphs from each file are saved. use constant SAVE_RATE => 2000; my $debug = 0; my $max_paras = 100000; my $err_count = 0; my $max_errs = 3; my @paras; my @indices; my $n_paras; srand; while ( my $filename = shift ) { &add_paras($filename); } #================================================== sub add_paras { my $legalese = 0; my $filename = shift; if ( not open FILE, $filename ) { print STDERR qq(can't open "$filename" for reading\n); if ( ++$err_count >= $max_errs ) { die "too many file open errors\n"; } return; } binmode( FILE, ':raw' ); my ( $dev, $inode, $mode, $nlink, $uid, $gid, $rdev, $size, $atime, $mtime, $ctime, $blocksize, $blocks ) = stat FILE; if ( $size == 0 ) { return; } print "getting paragraphs from $filename\n" if $debug; # we can't know how to decode the file until we examine it. # slurp the file first: my @lines = <FILE>; close FILE; my $decoder = &figure_out_encoding( \@lines ); if ( not $decoder ) { # figure_out_encoding should already have logged err msg if necess +ary return; } # we want the number of pararaphs taken from a file to be roughly # the same, whether it's a short one or a long one. my $prange = $size / SAVE_RATE; my $paras_added = 0; my $saved_indices = 0; my $in_para = 0; my $saving = 0; my $current_para = ""; for ( my $i = 0; $i < scalar @lines; $i++ ) { my $line = $decoder->decode( $lines[$i] ); my $linenum = $i+1; $line =~ s/\r//g; # leave newlines alone but strip CR if ( $line !~ m/^\s*$/ ) { # non-blank if ( $line =~ m/(the\s+)*Project\s+Gutenberg.*e-?(text|book)/i or $line =~ m/START.*SMALL\s+PRINT/i or $line =~ m/END OF.*PROJECT\s+GUTENBERG E-?(TEXT|BOOK)/ or $line =~ m/START.*FULL LICENSE/) { $legalese = 1; print "start skipping legalese on line $linenum \n" if $debu +g >= 3; next; } if ( $in_para == 0 ) { if ( (not $legalese) and (rand $prange <= 1) ) { push @indices, "lines $linenum+ of $filename\n\n"; $saved_indices++; $saving = 1; } } $in_para = 1; if ( $line =~ m/\END.*SMALL\s+PRINT/ or $line =~ m/START OF THE PROJECT\s+GUTENBERG.*E-?(BOOK|TEXT) +/ or $line =~ m/END.*FULL LICENSE/) { print "stop skipping legalese on line $. \n" if $debug >= 3; $legalese = 0; } } else { # blank line if ( $saving ) { push @paras, $current_para . "\n\n"; $paras_added++; } $current_para = ""; $in_para = 0; $saving = 0; } my $a = 1; $a++; if ( $saving ) { $a--; $current_para .= $line; } } if ( $saving ) { $current_para .= "\n\n"; push @paras, $current_para; $current_para = ""; $paras_added++; } } =pod Use Encode::Guess plus some heuristics to figure out the probable encoding of a file, based on an array of lines from the file (passed by reference because it could be very big and we don't need to modify it). Return a decoder object. =cut sub figure_out_encoding { my $arr_ref = shift; if ( not ref $arr_ref eq "ARRAY" or scalar @$arr_ref == 0 ) { die "arg to figure_out_encoding() should be ref to non-empty arra +y"; } # do an Encode::Guess on the join of all the lines # then if its results are ambiguous, do further heuristics my $content = join '', @$arr_ref; my $cp437_letters = 0; my $latin1_letters = 0; my $decoder = eval { Encode::Guess->guess( $content ); }; if ( $@ ) { my $eval_err = chomp $@; print qq($_: Encode::Guess->guess() failed horribly: "$eval_err"\n +); return undef; } if ( ref $decoder ) { print "appears to be " . $decoder->name . "\n" if $debug; if ( $decoder->name eq "iso-8859-1" ) { my $lines_to_check = ( scalar @$arr_ref > 1000 ) ? 1000 : sca +lar @$arr_ref; for ( my $i = 0; $i < $lines_to_check; $i++ ) { while ( $$arr_ref[$i] =~ m/[\x80-\xA5]/g ) { $cp437_letters++; } while ( $$arr_ref[$i] =~ m/[\xC0-\xFF]/g ) { $latin1_letters++; } } if ( $cp437_letters > $latin1_letters ) { $decoder = find_encoding( "cp437" ); if ( $debug ) { if ( ref $decoder ) { print "more chars in cp437 range ($cp437_letters) than + in Latin-1 letters range ($latin1_letters) so prob cp437\n"; } else { print "can't find encoding for 'cp437'\n"; } } return ( ref $decoder ? $decoder : undef ); } else { print "more chars in Latin-1 letters range ($latin1_letters) +than in cp437 range ($cp437_letters) so prob Latin-1\n" if $debug; return $decoder; } } return $decoder; } else { print qq(bad decoder returned by Encode::Guess->guess() ); print ( ( defined $decoder ) ? qq("$decoder") : "(undefined)" ); print qq(\n); if ( defined $decoder && $decoder =~ m/utf8/ ) { my $lines_to_check = ( scalar @$arr_ref > 1000 ) ? 1000 : sca +lar @$arr_ref; for ( my $i = 0; $i < $lines_to_check; $i++ ) { if ( $$arr_ref[$i] =~ m/[\x00-\x7F][\x80-\xFF][\x00-\x7F]/ or $$arr_ref[$i] =~ m/[\x00-\x7F][\x80-\xFF]$/ or $$arr_ref[$i] =~ m/^[\x80-\xFF][\x00-\x7F]$/ ) { # this can't be utf8 if it has a high-bit char by its lone +some # with low-bit chars or begin/end of line on either side $decoder = find_encoding( "iso-8859-1" ); if ( $debug ) { if ( ref $decoder ) { print "looks like iso-8859-1, using that encoding\n"; } else { print "can't find encoding for 'iso-8859-1'\n"; } } return ( ref $decoder ? $decoder : undef ); } } # end for each line of array $decoder = find_encoding( "utf8" ); if ( $debug ) { if ( ref $decoder ) { print "looks like utf8, using that encoding\n"; } else { print "can't find encoding for 'utf8'\n"; } } return ( ref $decoder ? $decoder : undef ); } else { print "bad decoder and not possible match for utf8\n" if $d +ebug; return undef; }# end if possible utf8 } # end if/else we have no good decoder ref yet }
        Specifying "UTF-16" doesn't indicate the byte order, so a BOM is required. You need to specify "UTF-16le" or "UTF-16be" if there's no BOM to indicate the byte order.
Re: Problem with Encode::Guess
by Anonymous Monk on Nov 10, 2010 at 18:38 UTC
      What layer string should I pass to binmode in a context like this, where I don't (by definition) know what encoding the file I'm opening uses? Doing this:
      open FH, $_ or die qq(can't open "$_" for reading\n); binmode FH, ':raw'; my @lines = <FH>; close FH;
      has no effect on the program's behavior, and anything more specific than ":raw" would seem to entail a premature assumption about what encoding an unknown file uses.