Re: Encode: unable to change encoding of strings

I've decided to use Encode to translate data to UTF-8 as soon as I download it. However, Perl is laughing at my attempts (the string is "Ámbito" in ISO-8859-1 and UTF-8 encodings)

Single-byte values in the range \x80-\xFF have a somewhat ambiguous, magical status in perl 5.8; they may be either single-byte values or "wide" utf8 characters, depending on the how they are used. Consider:

perl -e 'print "\xc1\n"' | xxd -g1
0000000: c1 0a                                          ..

perl -CO -e 'print "\xc1\n"' | xxd -g1
0000000: c3 81 0a                                       ...
[download]

In the second case, the -CO option on the command line tells perl to apply binmode ":utf8" to STDOUT. Perl 5.8's default behavior for byte values in the range 80-FF is to upgrade these automatically to two-byte utf8 characters when they are written to output through a utf8 PerlIO layer, or when the scalar containing them is explicitly flagged as a utf8 string. Otherwise, they remain single-byte values.

While playing with examples, I also came across the following, which might be instructive (if not too confusing):

$ perl -MEncode -e '$x="\xc1";
$y = decode("iso-8859-1",$x);   # $y has utf8 flag set
$c = ( $x eq $y ) ? "eq":"ne"; 
print "$x $c $y\n";' | od -ctxC

0000000  301       e   q     301  \n                                  
+  
           c1  20  65  71  20  c1  0a                                 
+   
0000007

$ perl -MEncode -e '$x="\xc1";
$y = encode("utf8",$x);         # utf8 flag is not set
$c = ( $x eq $y ) ? "eq":"ne"; 
print "$x $c $y\n";' | od -ctxC
0000000  301       n   e     303 201  \n                              
+  
           c1  20  6e  65  20  c3  81  0a                             
+   
0000010
[download]

The first case indicates why characters in the range 80-FF have special status in perl 5.8 (and why it's easy to get confused): they seem to be stored internally as single bytes, even when the scalar containing them is explicitly flagged as a utf8 string; whether they are single-byte or "wide" on output depends on whether you've done "binmode ':utf8'" on the given file handle. I gather this is a kind of "interim solution" intended to make a larger class of common situations "easier" to deal with (even though this default behavior is logically inconsistent with the Unicode Standard).

The second case shows how to assign the actual two-byte utf8 sequence for Á to a scalar, but this makes it "alien" to the perl-5.8 way of doing things. (Adding "-CO" to both cases yields predictable results.)

Anyway, if your problem is displaying PerlMonks pages or other 8859-1 text as utf8 data (which means converting from single-byte-per-char to variable-width-char), the following will suffice:

perl -pe 'BEGIN{binmode STDOUT,":utf8"}' < file.iso > file.utf8

# or, using the more cryptic "-C" option:

perl -CO -pe '' < file.iso > file.utf8
[download]

Comment on Re: Encode: unable to change encoding of strings Select or Download Code