This program converts a text file from a character encoding to another, but tries to detect if the user accidentally specifies the wrong character encoding, such as iso-8859-2 instead of utf-8.
This has a real world motivation. I'm writing a program that works by mostly copying its input to the output but annotates some parts of it. The program must accept input of multiple character encodings (iso-8859-2, utf-8, cp1250), and similarly must be able to emit the output in multiple encodings. The user can choose the input and output encodings with command-line switches. If, however, the user inputs an utf-8 file but specifies 8859-2 as the input and output encodings, the program may appear to work and output a utf-8 file. The program won't be able to understand those records that have non-ascii characters, but this might not be obvious from the output. For this reason, I added some code to detect this kinds of errors. This program here is a standalone program copying only the relevant code.
I don't try to detect if the input is of a different byte encoding than what the user specifies (such as cp1250 instead of 8859-2) because that'd be hard to do. I also don't try to detect utf-16 versus other encodings, because in that case the error will be obvious, and because we won't likely use those encodings anyway.
# ciconv.pl - example script on detecting encoding set wrong # # This is a simple iconv replacement that tries to detect when the use +r # accidentally sets the input encoding wrong, ie. sets a byte encoding # when the input is actually utf-8 or the other way. This can't be do +ne # completely reliably, but this may still be useful to catch simple er +rors. # This should work in perl 5.8 or newer. # # Usage: perl ciconv.pl -f inputencoding -t outputencoding files # use warnings; use strict; use Encode; use Getopt::Long; our $default_input_encoding = "iso-8859-1"; our $default_output_encoding = "iso-8859-1"; our($INFH, $input_encoding, $input_postdecobj, $input_declayer, $input +_isutf8, $input_decwarn); sub set_input_encoding { my($enc) = @_; my $obj = Encode::find_encoding($input_encoding = $enc) or die qq( +error: unknown input encoding: "$enc"); if (do { my $ts = "Egyenest oda fog folyamodni\n"; $ts eq $obj->de +code($ts) }) { # most ascii-based encodings, eg. 8859-*, cp-125*, utf-8, etc $input_declayer = undef; $input_postdecobj = $obj; $input_isutf8 = do { my $ts = "Ali gy\x{151}zelem-\x{fc}nnepe +van ma!\x{201d}\n"; $ts eq $obj->decode(encode_utf8($ts)) }; } else { # wide character encodings like utf-16, also ebcdic or other e +ncodings not related to ascii at all $input_declayer = ":encoding(" . $obj->name . ")"; $input_postdecobj = undef; } } sub getinputline { if (!$INFH && !@ARGV) { $ARGV = "-"; open $INFH, "<&=", *STDIN or die "error fduping stdin: $!"; if ($input_declayer) { binmode $INFH, $input_declayer or die "error: cannot set i +nput encoding io layer $input_declayer to stdin fdup"; } } while (!$INFH || eof($INFH)) { if (@ARGV) { open $INFH, "<", ($ARGV = shift @ARGV) or die "error openi +ng input file \"$ARGV\": $!"; if ($input_declayer) { binmode $INFH, $input_declayer or die "error: cannot s +et input encoding io layer $input_declayer"; } $input_decwarn = 0; } else { return undef; } } my $l = <$INFH>; if (!defined($l)) { warn "read error reading amsrefs input file $ARGV: $!"; return undef; } if ($input_postdecobj) { if ($input_isutf8) { $l =~ /(?:[\x00-\x7f]|\A)[\x80-\xff][\x00-\x7f]/ && !$inpu +t_decwarn++ and warn "warning: input is not utf-8 encoded near $ARGV:$ +., currently decoding using character encoding $input_encoding, make +sure you set the correct input encoding with the -f switch"; } else { $l =~ /[\xc2-\xdf\xe2][\x80-\xbf]/ && !$input_decwarn++ an +d warn "warning: input seems to be probably utf-8 encode +d near $ARGV:$., currently decoding using character encoding $input_e +ncoding, make sure you set the correct input encoding with the -t swi +tch"; } $l = $input_postdecobj->decode($l, Encode::FB_DEFAULT()); } $l; } sub cimain { my($inputenc, $outputenc); Getopt::Long::Configure qw"gnu_getopt prefix_pattern=(--|-)"; GetOptions( "inputencoding|input-encoding|from-code|f=s", \$inputenc, "outputencoding|output-encoding|to-code|t=s", \$outputenc, ); $inputenc ||= $default_input_encoding; set_input_encoding($inputenc); $outputenc ||= $default_output_encoding; my $oobj = Encode::find_encoding($outputenc) or die qq(error: unknown output encoding: "$outputenc"); my $olayer = ":encoding(" . $oobj->name . ")"; binmode STDOUT, $olayer or die "error: cannot set output encoding layer $olayer: $!"; while ($_ = getinputline()) { print $_; } } cimain(); __END__
In reply to Decode character encodings, warn on user mistake by ambrus
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |