dwarf has asked for the wisdom of the Perl Monks concerning the following question:

All right, here is the scoop: I have an Excel file which has some Central European characters in it. I want to read those characters and replace them with something else (extended ASCII characters). I first used your ordinary s//operator:
while (chomp($line = <>)){ $line =~ s/char/replacement/g; }
But this didn't work (of course). Then I used the utf8 module:
use utf8; while (chomp($line = <>)){ $line =~ s/char/replacement/g; }
But this gave me only the warning that I have "illegal Unicode characters" or something like that... So my question is simple: has anyone played with Unicode support in Perl and does anyone has an idea how can I do this?? Oh, and yes, I'm writing this on Win2K Server, Excel is 2000 and Perl is 5.6.0 ActiveState (yes, a Windows project).

Replies are listed 'Best First'.
Re: Unicode in Perl
by mirod (Canon) on Sep 15, 2001 at 12:42 UTC

    You have to trust Perl here: if you get an "illegal Unicode characters" your data probably isn't Unicode. My guess would be that it's just ISO 8859-2 (see The ISO 8859 Alphabet Soup for more details).

    There are probably modules to help you convert this, although I don't know them, but a naive way would be to use tr/// to do the substitution.

    Something like this would work:

    perl -e' $s="à tarif très réduit"; $s=~ tr/\xE0\xE8\xE9/aee/; print $s, "\n";'

    Note that I used latin-1 (ISO-8859-1) characters because that's what my editor likes, but you can use the ISO-8859-2 table to define your own substitution.

Re: Unicode in Perl
by John M. Dlugosz (Monsignor) on Sep 15, 2001 at 06:01 UTC
    I've used Unicode in Perl extensivly, and logged many of the bugs noted in ActiveState's database. Bleading edge kind of guy, I guess.

    re "something like that": you mean Illegal UTF-8 codes? If you are getting the data via OLE from Excel, look at a switch to enable proper Unicode returns. I ran into that a short time ago with Word, and there is a node where someone answered me. Set UTF8 as the Code Page in the OLE module, but I forget the exact call. I still had issues, so find my previous note...

    —John

      Well, I am getting the data from a text file. I never used OLE before from Perl, to tell you the truth. I appreciate the info though, will try to find something about that. Thanks for the help...
        Well, then look at the data from the text file in hex to determine what encoding system it's using. Presumably it's NOT UTF-8, from the warning you get. So it's probably using some Windows Code Page.

        —John