Re^2: Problem with Encode::Guess

OK, I've written a wrapper for Encode::Guess to handle the ambiguous encodings and unknown encodings errors, and to distinguish between cp437 and Latin-1. It seems to be pretty accurate, and decoder objects it produces handle those encodings plus utf8 just fine. However, my main script, using the decoder object returned by that function, is failing when it tries to use a UTF-16 decoder to decode a line of UTF-16 text; it dies with an error like this:

UTF-16:Unrecognised BOM 6e at /home/jim/Documents/scripts/textual-slid
+eshow.pl line 549, <FILE> line 10.
[download]

Is it expecting every line of a UTF-16 file to have a byte-order mark, or for me to decode the entire file in one call to decode()? neither of those seems reasonable.

Here is a stripped-down version of my script that reproduces the error without a lot of irrelevant code. The full script this is taken from is at http://jimhenry.conlang.org/scripts/textual-slideshow.zip.

#! /usr/bin/perl -w

# stripped-down version of textual-slideshow.pl
# for reproducing bad byte-order-mark defect

use Term::ReadKey;
use strict;
use warnings;
use Encode;
use Encode::Guess qw( iso-8859-1 );

#use open qw( :std :encoding(UTF-8) );

# proportion between file size and probability that a given
# paragraph will be saved for later display.  the smaller this number,
# the fewer paragraphs from each file are saved.
use constant SAVE_RATE => 2000;

my $debug = 0;
my $max_paras = 100000;
my $err_count = 0;
my $max_errs = 3;
my @paras;
my @indices;
my $n_paras;

srand;

while ( my $filename = shift ) {
    &add_paras($filename);
}

#==================================================


sub add_paras {
    my $legalese = 0;
    my $filename = shift;
    if ( not open FILE, $filename ) {
    print STDERR qq(can't open "$filename" for reading\n);
    if ( ++$err_count >= $max_errs ) {
        die "too many file open errors\n";
    }
    return;
    }

    binmode( FILE, ':raw' );

    my ( $dev, $inode, $mode, $nlink, $uid, $gid, $rdev, 
     $size,
     $atime, $mtime, $ctime, $blocksize, $blocks )
    = stat FILE;
    
    if ( $size == 0 ) {
    return;
    }

    print "getting paragraphs from $filename\n"    if $debug;

    # we can't know how to decode the file until we examine it.
    # slurp the file first:

    my @lines = <FILE>;
    close FILE;
    my $decoder = &figure_out_encoding( \@lines );
    if ( not $decoder ) {
    # figure_out_encoding should already have logged err msg if necess
+ary
    return;
    }

    # we want the number of pararaphs taken from a file to be roughly
    #  the same, whether it's a short one or a long one.
    my $prange = $size / SAVE_RATE;

    my $paras_added = 0;
    my $saved_indices = 0;
    my $in_para = 0;
    my $saving = 0;
    my $current_para = "";

    for ( my $i = 0; $i < scalar @lines; $i++ ) {
    my $line = $decoder->decode( $lines[$i] );
    my $linenum = $i+1;
    $line =~ s/\r//g;        # leave newlines alone but strip CR
    if ( $line !~ m/^\s*$/ ) {    # non-blank
        if ( $line =~ m/(the\s+)*Project\s+Gutenberg.*e-?(text|book)/i
        or $line =~ m/START.*SMALL\s+PRINT/i 
        or $line =~ m/END OF.*PROJECT\s+GUTENBERG E-?(TEXT|BOOK)/
        or $line =~ m/START.*FULL LICENSE/) {
        $legalese = 1;
        print "start skipping legalese on line $linenum \n"   if $debu
+g >= 3;
        next;
        }

        if ( $in_para == 0 ) {
        if ( (not $legalese) and (rand $prange <= 1) ) {
            push @indices, "lines $linenum+ of $filename\n\n";
            $saved_indices++;
            $saving = 1;
        }
        }
        $in_para = 1;

        if ( $line =~  m/\END.*SMALL\s+PRINT/
        or $line =~ m/START OF THE PROJECT\s+GUTENBERG.*E-?(BOOK|TEXT)
+/
        or $line =~ m/END.*FULL LICENSE/) {
        print "stop skipping legalese on line $. \n"   if $debug >= 3;
        $legalese = 0;
        }
    } else {        # blank line
        if ( $saving ) {
        push @paras, $current_para . "\n\n";
        $paras_added++;
        }
        $current_para = "";
        $in_para = 0;
        $saving = 0;
    }

    my $a = 1;
    $a++;

    if ( $saving ) {
        $a--;
        $current_para .= $line;
    }
    }

    if ( $saving ) {
    $current_para .= "\n\n";
    push @paras, $current_para;
    $current_para = "";
    $paras_added++;
    }
}

=pod

Use Encode::Guess plus some heuristics to figure out the
probable encoding of a file, based on an array of lines
from the file (passed by reference because it could be
very big and we don't need to modify it).  Return a decoder 
object.

=cut

sub figure_out_encoding {
    my $arr_ref = shift;
    if ( not ref $arr_ref eq "ARRAY" or scalar @$arr_ref == 0 ) {
    die "arg to  figure_out_encoding() should be ref to non-empty arra
+y";
    }

    # do an Encode::Guess on the join of all the lines
    # then if its results are ambiguous, do further heuristics

    my $content = join '', @$arr_ref;
    my $cp437_letters = 0;
    my $latin1_letters = 0;
    my $decoder = eval { Encode::Guess->guess( $content ); };

    if ( $@ ) {
    my $eval_err = chomp $@;
    print qq($_: Encode::Guess->guess() failed horribly: "$eval_err"\n
+);
    return undef;
    }

    if ( ref $decoder ) {
    print "appears to be " . $decoder->name . "\n"    if $debug;
    if ( $decoder->name eq "iso-8859-1" ) {
        my $lines_to_check = ( scalar @$arr_ref > 1000 ) ? 1000 :  sca
+lar @$arr_ref;
        for ( my $i = 0;  $i < $lines_to_check;  $i++ ) {
        while ( $$arr_ref[$i] =~ m/[\x80-\xA5]/g ) {
            $cp437_letters++;
        }
        while ( $$arr_ref[$i] =~ m/[\xC0-\xFF]/g ) {
            $latin1_letters++;
        }
        }
        if ( $cp437_letters > $latin1_letters ) {
        $decoder = find_encoding( "cp437" );
            if ( $debug ) {
            if ( ref $decoder ) {
                print "more chars in cp437 range ($cp437_letters) than
+ in Latin-1 letters range ($latin1_letters) so prob cp437\n";
            } else {
                print "can't find encoding for 'cp437'\n";
            }
            }


        return ( ref $decoder ? $decoder : undef );
        } else {
        print "more chars in Latin-1 letters range ($latin1_letters)  
+than in cp437 range ($cp437_letters) so prob Latin-1\n"    if $debug;
        return $decoder;
        }
    }
    return $decoder;
    } else {
    print qq(bad decoder returned by Encode::Guess->guess() );
    print ( ( defined $decoder ) ? qq("$decoder") : "(undefined)" );
    print qq(\n);
    
    if ( defined $decoder && $decoder =~ m/utf8/ ) {
        my $lines_to_check = ( scalar @$arr_ref > 1000 ) ? 1000 :  sca
+lar @$arr_ref;
        for ( my $i = 0;  $i < $lines_to_check;  $i++ ) {
        if ( $$arr_ref[$i] =~ m/[\x00-\x7F][\x80-\xFF][\x00-\x7F]/
             or $$arr_ref[$i] =~ m/[\x00-\x7F][\x80-\xFF]$/ 
             or $$arr_ref[$i] =~ m/^[\x80-\xFF][\x00-\x7F]$/ ) 
        {
            # this can't be utf8 if it has a high-bit char by its lone
+some
            # with low-bit chars or begin/end of line on either side
            $decoder = find_encoding( "iso-8859-1" );
            if ( $debug ) {
            if ( ref $decoder ) {
                print "looks like iso-8859-1, using that encoding\n";
            } else {
                print "can't find encoding for 'iso-8859-1'\n";
            }
            }
            return ( ref $decoder ? $decoder : undef );
        }
        
        } # end for each line of array
        $decoder = find_encoding( "utf8" );
        if ( $debug ) {
        if ( ref $decoder ) {
            print "looks like utf8, using that encoding\n";
        } else {
            print "can't find encoding for 'utf8'\n";
        }
        }
        return ( ref $decoder ? $decoder : undef );
        
    } else {
        print "bad decoder and not possible match for utf8\n"    if $d
+ebug;
        return undef;
    }# end if possible utf8
    } # end if/else we have no good decoder ref yet
}
[download]

Comment on Re^2: Problem with Encode::Guess Select or Download Code

Replies are listed 'Best First'.
Re^3: Problem with Encode::Guess by ikegami (Patriarch) on Nov 11, 2010 at 02:54 UTC
Specifying "UTF-16" doesn't indicate the byte order, so a BOM is required. You need to specify "UTF-16le" or "UTF-16be" if there's no BOM to indicate the byte order.	[reply]
Re^4: Problem with Encode::Guess by jimhenry (Acolyte) on Nov 11, 2010 at 09:50 UTC
So apparently the decoder object created by Encode::Guess remembers that it's UTF-16 but forgets immediately whether it was big-endian or little-endian? That seems broken. But I guess I can add more code to handle that. What would you suggest to be the cleanest way? If Encode::Guess->guess returns a UTF-16 decoder object, should I throw it away, look at the first four bytes of the file, and create a new decoder using find_encoding( "UTF16BE" ) or find_encoding( "UTF16LE" )? Or store information about the byte order somewhere else, and use something other than $decoder->decode() to decode strings read from a UTF-16 file?	[reply]
Re^5: Problem with Encode::Guess by ikegami (Patriarch) on Nov 11, 2010 at 18:17 UTC
So apparently the decoder object created by Encode::Guess remembers that it's UTF-16 but forgets immediately whether it was big-endian or little-endian? That seems broken. Looking at the code, it "forgets" if it finds a BOM, it remembers otherwise. That's reasonable since it would cause `$enc->encode` to add the BOM if and only if there was one present in the sample. You should probably incorporate into your code the couple of lines of E::G you actually do use. Then , you can change this behaviour if you don't like it.	[reply] [d/l]
Re^6: Problem with Encode::Guess by jimhenry (Acolyte) on Nov 14, 2010 at 01:18 UTC


Clear questions and runnable code get the best and fastest answer
	PerlMonks