in reply to Re: UTF-8 to ISO-8859-1
in thread UTF-8 to ISO-8859-1

What's the solution in 5.005_03?

------
We are the carpenters and bricklayers of the Information Age.

Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Replies are listed 'Best First'.
Re: Re2: UTF-8 to ISO-8859-1
by bart (Canon) on Mar 05, 2003 at 01:20 UTC
    Well, several solutions have been pointed to. Here's one I have used myself.
    my(%encoding,%decoding); sub UTF8::chr ($) { my $ord = shift; if($ord && $ord < 0x80) { return chr $ord; # OR: pack 'C', $ord; } elsif ($ord < 0x800) { return pack 'C2', 0xC0 | ($ord>>6), 0x80 | ($ord & 0x3F); } else { return pack 'C3', 0xE0 | ($ord>>12), 0x80 | (($ord>>6) & 0x3F) +, 0x80 | ($ord & 0x3F); } } #initialize for my $ord (0, 128 .. 256) { $encoding{chr $ord} = UTF8::chr($ord); } %decoding = reverse %encoding; sub UTF8_to_L1 { foreach (@_ = @_) { s/(\000|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xFF][\x80-\xBF][\x80-\xB +F])/$decoding{$1} || "(#$1#)"/ge; } return wantarray ? @_ : pop; } sub L1_to_UTF8 { foreach (@_ = @_) { s/([\000\x80-\xFF])/$encoding{$1}/g; } return wantarray?@arg:$arg[-1]; }
    In order to make it work for 5.6 too, you need to "disarm" the UTF-8 strings in the UTF8_to_L1 sub, for example using pack('C0a*', $string)

    For completeness sake, here's a sub to turn UTF-8 strings into a ordinal:

    sub UTF8::ord ($) { my $chr = shift; unless ($chr =~ /^([\300-\377][\200-\277]+)/) { return ord $chr; # 1 byte } my @ord = unpack 'C*', $1; if($ord[0] & 0x20) { # 0xE0 .. 0xFF return ($ord[0] & 0x1F)<<12 | ($ord[1] & 0x3F)<<6 | $ord[2] & +0x3F; } else { return ($ord[0] & 0x1F)<<6 | $ord[1] & 0x3F; } }