Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I have a script which needs to be able to deal with a wide variety of text files -- potentially, any text file on the filesystem. It's not practical to require the user to specify the encoding for every file. A first draft of it handles ASCII and UTF-8 correctly, but mangles the high-bit characters in Latin-1 or UTF-16 text files.

I've been studying on how to handle this, and it seems I need to guess the encoding of various input files, explicitly decode them to Perl's internal format for processing, then explicitly encode them to utf8 when outputting. I wrote a separate, smaller test script to try out Encode::Guess. Reading its man page and various older threads here, it looks as though Encode::Guess ought to be reliable at distinguishing ASCII vs Latin-1 vs. utf-8, but in fact, it's not even doing that for me:

#! /usr/bin/perl -w use strict; use Encode; # cp437 should be the old IBM PC character set, which is used in a # lot of Gutenberg etexts from the 1990s #use Encode::Guess qw( iso-8859-1 cp437 ); use Encode::Guess qw( iso-8859-1 ); #use Encode::Guess; foreach ( @ARGV ) { open FH, $_ or die qq(can't open "$_" for reading\n); my @lines = <FH>; close FH; my $content = join '\n', @lines; my $decoder = eval { Encode::Guess->guess( $content ); }; if ( $@ ) { my $eval_err = chomp $@; print qq($_: Encode::Guess->guess() failed horribly: "$eval_err"\n +); next; } if ( ref $decoder ) { print "$_: appears to be " . $decoder->name . "\n"; } else { print "$_: bad decoder returned by Encode::Guess->guess() "; print ( ( defined $decoder ) ? $decoder : "(undefined)" ); print "\n"; } }

On a directory containing a mix of ASCII, Latin-1, utf-8, and UTF-16 files, this prints accurate and confident identification of everything except for the utf-8 files, which get messages like this:

/home/jim/Documents/homepage/gzb/drafts/universal-violations.txt: bad +decoder returned by Encode::Guess->guess() iso-8859-1 or utf8
I suppose I could plug code like that into my main script, and assume that any file whose contents get a bad decoder is utf8, but that seems unwise. Or I could take any file that gets a "bad decoder" message and see if it has a pervasive pattern of always having two high-bit characters in a row, but that might be wasteful if there's an existing module that could do that. Does anyone have a better idea about distinguishing utf8 from Latin-1 or other 8-bit encodings?

In the script above, there's a line commented out giving cp437 as one of the defaults to initialize Encode::Guess; if I have that line in, every file with high-bit characters gets a bad decoder. This is not surprising, given the man page's warning that Encode::Guess is bad at distinguishing different 8-bit encodings from each other. I have a lot of cp437 etexts lying around, but I'm pretty sure I can write an ad-hoc routine to distinguish them from the Latin-1 text files -- in theory both code pages use all the characters from 0x80 to 0xFF, but in practice, only accented Latin letters characters in the 80 to A5 range are common in cp437 text files and only characters in the C0 to FF range are common in Latin-1 files.


In reply to Problem with Encode::Guess by jimhenry

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2024-03-28 08:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found