Decode character encodings, warn on user mistake

This program converts a text file from a character encoding to another, but tries to detect if the user accidentally specifies the wrong character encoding, such as iso-8859-2 instead of utf-8.

This has a real world motivation. I'm writing a program that works by mostly copying its input to the output but annotates some parts of it. The program must accept input of multiple character encodings (iso-8859-2, utf-8, cp1250), and similarly must be able to emit the output in multiple encodings. The user can choose the input and output encodings with command-line switches. If, however, the user inputs an utf-8 file but specifies 8859-2 as the input and output encodings, the program may appear to work and output a utf-8 file. The program won't be able to understand those records that have non-ascii characters, but this might not be obvious from the output. For this reason, I added some code to detect this kinds of errors. This program here is a standalone program copying only the relevant code.

I don't try to detect if the input is of a different byte encoding than what the user specifies (such as cp1250 instead of 8859-2) because that'd be hard to do. I also don't try to detect utf-16 versus other encodings, because in that case the error will be obvious, and because we won't likely use those encodings anyway.

# ciconv.pl - example script on detecting encoding set wrong
#
# This is a simple iconv replacement that tries to detect when the use
+r
# accidentally sets the input encoding wrong, ie. sets a byte encoding
# when the input is actually utf-8 or the other way.  This can't be do
+ne
# completely reliably, but this may still be useful to catch simple er
+rors.
# This should work in perl 5.8 or newer.
#
# Usage: perl ciconv.pl -f inputencoding -t outputencoding files
#

use warnings; use strict;
use Encode;
use Getopt::Long;

our $default_input_encoding = "iso-8859-1";
our $default_output_encoding = "iso-8859-1";

our($INFH, $input_encoding, $input_postdecobj, $input_declayer, $input
+_isutf8, $input_decwarn);

sub set_input_encoding {
    my($enc) = @_;
    my $obj = Encode::find_encoding($input_encoding = $enc) or die qq(
+error: unknown input encoding: "$enc");
    if (do { my $ts = "Egyenest oda fog folyamodni\n"; $ts eq $obj->de
+code($ts) }) {
        # most ascii-based encodings, eg. 8859-*, cp-125*, utf-8, etc
        $input_declayer = undef;
        $input_postdecobj = $obj;
        $input_isutf8 = do { my $ts = "Ali gy\x{151}zelem-\x{fc}nnepe 
+van ma!\x{201d}\n"; $ts eq $obj->decode(encode_utf8($ts)) };
    } else {
        # wide character encodings like utf-16, also ebcdic or other e
+ncodings not related to ascii at all
        $input_declayer = ":encoding(" . $obj->name . ")";
        $input_postdecobj = undef;
    }
}

sub getinputline {
    if (!$INFH && !@ARGV) {
        $ARGV = "-";
        open $INFH, "<&=", *STDIN or die "error fduping stdin: $!";
        if ($input_declayer) {
            binmode $INFH, $input_declayer or die "error: cannot set i
+nput encoding io layer $input_declayer to stdin fdup";
        }
    }
    while (!$INFH || eof($INFH)) {
        if (@ARGV) {
            open $INFH, "<", ($ARGV = shift @ARGV) or die "error openi
+ng input file \"$ARGV\": $!";
            if ($input_declayer) {
                binmode $INFH, $input_declayer or die "error: cannot s
+et input encoding io layer $input_declayer";
            }
            $input_decwarn = 0;
        } else {
            return undef;
        }
    }
    my $l = <$INFH>;
    if (!defined($l)) {
        warn "read error reading amsrefs input file $ARGV: $!";
        return undef;
    }
    if ($input_postdecobj) {
        if ($input_isutf8) {
            $l =~ /(?:[\x00-\x7f]|\A)[\x80-\xff][\x00-\x7f]/ && !$inpu
+t_decwarn++ and
                warn "warning: input is not utf-8 encoded near $ARGV:$
+., currently decoding using character encoding $input_encoding, make 
+sure you set the correct input encoding with the -f switch";
        } else {
            $l =~ /[\xc2-\xdf\xe2][\x80-\xbf]/ && !$input_decwarn++ an
+d
                warn "warning: input seems to be probably utf-8 encode
+d near $ARGV:$., currently decoding using character encoding $input_e
+ncoding, make sure you set the correct input encoding with the -t swi
+tch";
        }
        $l = $input_postdecobj->decode($l, Encode::FB_DEFAULT());
    }
    $l;
}

sub cimain {

    my($inputenc, $outputenc);

    Getopt::Long::Configure qw"gnu_getopt prefix_pattern=(--|-)";
    GetOptions(
        "inputencoding|input-encoding|from-code|f=s", \$inputenc,
        "outputencoding|output-encoding|to-code|t=s", \$outputenc,
    );

    $inputenc ||= $default_input_encoding;
    set_input_encoding($inputenc);

    $outputenc ||= $default_output_encoding;
    my $oobj = Encode::find_encoding($outputenc) or 
        die qq(error: unknown output encoding: "$outputenc");
    my $olayer = ":encoding(" . $oobj->name . ")";
    binmode STDOUT, $olayer or 
        die "error: cannot set output encoding layer $olayer: $!";
    
    while ($_ = getinputline()) {
        print $_;
    }
    
}

cimain();

__END__
[download]

Comment on Decode character encodings, warn on user mistake Download Code

Replies are listed 'Best First'.
Re: Decode character encodings, warn on user mistake by Tux (Canon) on Dec 22, 2009 at 07:24 UTC
And if you do not know the original encoding at all, but you can tell a few characters for sure, this might help: `use strict; use warnings; use Encode "decode"; binmode STDOUT, ":utf8"; my @enc = grep { !m/^mime/i } Encode->encodings (":all"); my $c = pack "H", shift; foreach my $e (@enc) { my $x = eval { decode ($e, $c) }; !defined $x \|\| $x =~ m/^(?:$\|\x{fffd})/ and next; printf " %-30s %s\n", $e, $x; }` [download] If you know that in your text, `\xD7` is `×` and `\xE4` is `ä`, then: $ find_enc d7e4 \| grep ×ä cp1250 ×ä cp1252 ×ä cp1254 ×ä cp1257 ×ä cp1258 ×ä iso-8859-1 ×ä iso-8859-13 ×ä iso-8859-15 ×ä iso-8859-2 ×ä iso-8859-3 ×ä iso-8859-4 ×ä iso-8859-9 ×ä UTF-7 ×ä $ [download] The more you know, the smaller your result set might be. If you also* know that `\xF0` is `đ` or `š`, you're down to 4 or even 2: `$ find_enc d7e4f0 \| grep ×äđ cp1250 ×äđ cp1258 ×äđ iso-8859-2 ×äđ iso-8859-4 ×äđ $ find_enc d7e4f0 \| grep ×äš cp1257 ×äš iso-8859-13 ×äš $` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re: Decode character encodings, warn on user mistake by merlyn (Sage) on Dec 22, 2009 at 16:24 UTC
Did you find Encode::Guess unsuitable? Or is this doing a different task? -- Randal L. Schwartz, Perl hacker The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.	[reply]
Re^2: Decode character encodings, warn on user mistake by ambrus (Abbot) on Dec 22, 2009 at 17:17 UTC
Yes, using Encode::Guess might have worked too I suppose. That would require restricting myself to reading the whole input at once, but that wouldn't be too bad in this application anyway. However, if I read the whole input at once, that would have simplified my code a great deal too, and in that case I don't think Encode::Guess would have helped much compared to just testing for utf-8 and ascii input by hand. This code I can at least reuse later if I really need to read character encoded text data one line at a time in some program.	[reply]