Unicode in Perl

dwarf has asked for the wisdom of the Perl Monks concerning the following question:

All right, here is the scoop: I have an Excel file which has some Central European characters in it. I want to read those characters and replace them with something else (extended ASCII characters). I first used your ordinary s//operator:

while (chomp($line = <>)){
$line =~ s/char/replacement/g;
}
[download]

But this didn't work (of course). Then I used the utf8 module:

use utf8;
while (chomp($line = <>)){
$line =~ s/char/replacement/g;
}
[download]

But this gave me only the warning that I have "illegal Unicode characters" or something like that... So my question is simple: has anyone played with Unicode support in Perl and does anyone has an idea how can I do this?? Oh, and yes, I'm writing this on Win2K Server, Excel is 2000 and Perl is 5.6.0 ActiveState (yes, a Windows project).

Comment on Unicode in Perl Select or Download Code

Replies are listed 'Best First'.
Re: Unicode in Perl by mirod (Canon) on Sep 15, 2001 at 12:42 UTC
You have to trust Perl here: if you get an "illegal Unicode characters" your data probably isn't Unicode. My guess would be that it's just ISO 8859-2 (see The ISO 8859 Alphabet Soup for more details). There are probably modules to help you convert this, although I don't know them, but a naive way would be to use `tr///` to do the substitution. Something like this would work: `perl -e' $s="à tarif très réduit"; $s=~ tr/\xE0\xE8\xE9/aee/; print $s, "\n";'` Note that I used latin-1 (ISO-8859-1) characters because that's what my editor likes, but you can use the ISO-8859-2 table to define your own substitution.	[reply] [d/l]
Re: Unicode in Perl by John M. Dlugosz (Monsignor) on Sep 15, 2001 at 06:01 UTC
I've used Unicode in Perl extensivly, and logged many of the bugs noted in ActiveState's database. Bleading edge kind of guy, I guess. re "something like that": you mean Illegal UTF-8 codes? If you are getting the data via OLE from Excel, look at a switch to enable proper Unicode returns. I ran into that a short time ago with Word, and there is a node where someone answered me. Set UTF8 as the Code Page in the OLE module, but I forget the exact call. I still had issues, so find my previous note... —John	[reply]
Re: Re: Unicode in Perl by dwarf (Initiate) on Sep 15, 2001 at 10:36 UTC
Well, I am getting the data from a text file. I never used OLE before from Perl, to tell you the truth. I appreciate the info though, will try to find something about that. Thanks for the help...	[reply]
Re: Re: Re: Unicode in Perl by John M. Dlugosz (Monsignor) on Sep 16, 2001 at 04:37 UTC
Well, then look at the data from the text file in hex to determine what encoding system it's using. Presumably it's NOT UTF-8, from the warning you get. So it's probably using some Windows Code Page. —John	[reply]