Titprem has asked for the wisdom of the Perl Monks concerning the following question:

Hi ! I have a text file (iso.txt), encoded in ISO-8859-15, with the following word and characters :
Crèvecœur Œ Æ æ
I would like to encode it in UTF8, with the perl script below (perl v5.10.1) :
use Encode; open(my $iso,'<:encoding(iso-8859-15)','iso.txt'); open(my $utf,'>:utf8','utf.txt'); while(<$iso>){ print $utf $_; } close($utf); close($iso);
But in the output file (utf.txt), the œ ligature is encoded "0xC2 0x9C" (and thus is not printed) instead of "0xC5 0x93" ! The same for the Œ. But other ligatures (Æ et æ) are properly encoded ! Where is the bug ? Thanks !

Replies are listed 'Best First'.
Re: Problem with œ / Œ ligature encoding
by choroba (Cardinal) on Jun 15, 2015 at 17:20 UTC
    Can you show a hexdump of your input file? I tried to create it by reverting the process (i.e. I generated it from a UTF-8 input) and the output of your script was exactly the same as the original UTF-8 input.
    $ xxd utf.in 00000000: 4372 c3a8 7665 63c5 9375 720a c592 0ac3 Cr..vec..ur..... 00000010: 860a c3a6 0a ..... $ perl -we 'binmode *STDOUT, "encoding(iso-8859-15)"; open my $IN, "<:encoding(utf-8)", "utf.in" or die $!; print while <$IN>; ' > iso.txt $ xxd iso.txt 00000000: 4372 e876 6563 bd75 720a bc0a c60a e60a Cr.vec.ur....... $ ./1130501.pl $ diff utf.out utf.in # No output
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Problem with œ / Œ ligature encoding
by tonto (Friar) on Jun 15, 2015 at 17:43 UTC

    I ran your script with your input file and the output file is exactly like the input only in UTF-8. Are you sure your iso.txt is in ISO-8859-15?

      My bad, you're right ! The input file was encoded in Windows-1252... Sorry. :(
Re: Problem with œ / Œ ligature encoding
by Anonymous Monk on Jun 15, 2015 at 17:21 UTC
    Try adding use Unicode::CharName 'uname'; and warn uname ord $1 while /(.)/g; to your while loop. Thus you'll be able to know how exactly Perl decodes your text and whether the error happens at the decoding or the encoding phase.