comment on

This program converts a text file from a character encoding to another, but tries to detect if the user accidentally specifies the wrong character encoding, such as iso-8859-2 instead of utf-8.

This has a real world motivation. I'm writing a program that works by mostly copying its input to the output but annotates some parts of it. The program must accept input of multiple character encodings (iso-8859-2, utf-8, cp1250), and similarly must be able to emit the output in multiple encodings. The user can choose the input and output encodings with command-line switches. If, however, the user inputs an utf-8 file but specifies 8859-2 as the input and output encodings, the program may appear to work and output a utf-8 file. The program won't be able to understand those records that have non-ascii characters, but this might not be obvious from the output. For this reason, I added some code to detect this kinds of errors. This program here is a standalone program copying only the relevant code.

I don't try to detect if the input is of a different byte encoding than what the user specifies (such as cp1250 instead of 8859-2) because that'd be hard to do. I also don't try to detect utf-16 versus other encodings, because in that case the error will be obvious, and because we won't likely use those encodings anyway.

# ciconv.pl - example script on detecting encoding set wrong
#
# This is a simple iconv replacement that tries to detect when the use
+r
# accidentally sets the input encoding wrong, ie. sets a byte encoding
# when the input is actually utf-8 or the other way.  This can't be do
+ne
# completely reliably, but this may still be useful to catch simple er
+rors.
# This should work in perl 5.8 or newer.
#
# Usage: perl ciconv.pl -f inputencoding -t outputencoding files
#

use warnings; use strict;
use Encode;
use Getopt::Long;

our $default_input_encoding = "iso-8859-1";
our $default_output_encoding = "iso-8859-1";

our($INFH, $input_encoding, $input_postdecobj, $input_declayer, $input
+_isutf8, $input_decwarn);

sub set_input_encoding {
    my($enc) = @_;
    my $obj = Encode::find_encoding($input_encoding = $enc) or die qq(
+error: unknown input encoding: "$enc");
    if (do { my $ts = "Egyenest oda fog folyamodni\n"; $ts eq $obj->de
+code($ts) }) {
        # most ascii-based encodings, eg. 8859-*, cp-125*, utf-8, etc
        $input_declayer = undef;
        $input_postdecobj = $obj;
        $input_isutf8 = do { my $ts = "Ali gy\x{151}zelem-\x{fc}nnepe 
+van ma!\x{201d}\n"; $ts eq $obj->decode(encode_utf8($ts)) };
    } else {
        # wide character encodings like utf-16, also ebcdic or other e
+ncodings not related to ascii at all
        $input_declayer = ":encoding(" . $obj->name . ")";
        $input_postdecobj = undef;
    }
}

sub getinputline {
    if (!$INFH && !@ARGV) {
        $ARGV = "-";
        open $INFH, "<&=", *STDIN or die "error fduping stdin: $!";
        if ($input_declayer) {
            binmode $INFH, $input_declayer or die "error: cannot set i
+nput encoding io layer $input_declayer to stdin fdup";
        }
    }
    while (!$INFH || eof($INFH)) {
        if (@ARGV) {
            open $INFH, "<", ($ARGV = shift @ARGV) or die "error openi
+ng input file \"$ARGV\": $!";
            if ($input_declayer) {
                binmode $INFH, $input_declayer or die "error: cannot s
+et input encoding io layer $input_declayer";
            }
            $input_decwarn = 0;
        } else {
            return undef;
        }
    }
    my $l = <$INFH>;
    if (!defined($l)) {
        warn "read error reading amsrefs input file $ARGV: $!";
        return undef;
    }
    if ($input_postdecobj) {
        if ($input_isutf8) {
            $l =~ /(?:[\x00-\x7f]|\A)[\x80-\xff][\x00-\x7f]/ && !$inpu
+t_decwarn++ and
                warn "warning: input is not utf-8 encoded near $ARGV:$
+., currently decoding using character encoding $input_encoding, make 
+sure you set the correct input encoding with the -f switch";
        } else {
            $l =~ /[\xc2-\xdf\xe2][\x80-\xbf]/ && !$input_decwarn++ an
+d
                warn "warning: input seems to be probably utf-8 encode
+d near $ARGV:$., currently decoding using character encoding $input_e
+ncoding, make sure you set the correct input encoding with the -t swi
+tch";
        }
        $l = $input_postdecobj->decode($l, Encode::FB_DEFAULT());
    }
    $l;
}

sub cimain {

    my($inputenc, $outputenc);

    Getopt::Long::Configure qw"gnu_getopt prefix_pattern=(--|-)";
    GetOptions(
        "inputencoding|input-encoding|from-code|f=s", \$inputenc,
        "outputencoding|output-encoding|to-code|t=s", \$outputenc,
    );

    $inputenc ||= $default_input_encoding;
    set_input_encoding($inputenc);

    $outputenc ||= $default_output_encoding;
    my $oobj = Encode::find_encoding($outputenc) or 
        die qq(error: unknown output encoding: "$outputenc");
    my $olayer = ":encoding(" . $oobj->name . ")";
    binmode STDOUT, $olayer or 
        die "error: cannot set output encoding layer $olayer: $!";
    
    while ($_ = getinputline()) {
        print $_;
    }
    
}

cimain();

__END__
[download]

In reply to Decode character encodings, warn on user mistake by ambrus

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.