Re2: UTF-8 to ISO-8859-1

What's the solution in 5.005_03?

------
We are the carpenters and bricklayers of the Information Age.

Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Comment on Re2: UTF-8 to ISO-8859-1

Replies are listed 'Best First'.
Re: Re2: UTF-8 to ISO-8859-1 by bart (Canon) on Mar 05, 2003 at 01:20 UTC
Well, several solutions have been pointed to. Here's one I have used myself. my(%encoding,%decoding); sub UTF8::chr ($) { my $ord = shift; if($ord && $ord < 0x80) { return chr $ord; # OR: pack 'C', $ord; } elsif ($ord < 0x800) { return pack 'C2', 0xC0 \| ($ord>>6), 0x80 \| ($ord & 0x3F); } else { return pack 'C3', 0xE0 \| ($ord>>12), 0x80 \| (($ord>>6) & 0x3F) +, 0x80 \| ($ord & 0x3F); } } #initialize for my $ord (0, 128 .. 256) { $encoding{chr $ord} = UTF8::chr($ord); } %decoding = reverse %encoding; sub UTF8_to_L1 { foreach (@_ = @_) { s/(\000\|[\xC0-\xDF][\x80-\xBF]\|[\xE0-\xFF][\x80-\xBF][\x80-\xB +F])/$decoding{$1} \|\| "(#$1#)"/ge; } return wantarray ? @_ : pop; } sub L1_to_UTF8 { foreach (@_ = @_) { s/([\000\x80-\xFF])/$encoding{$1}/g; } return wantarray?@arg:$arg[-1]; } [download] In order to make it work for 5.6 too, you need to "disarm" the UTF-8 strings in the UTF8_to_L1 sub, for example using `pack('C0a', $string)` For completeness sake, here's a sub to turn UTF-8 strings into a ordinal: `sub UTF8::ord ($) { my $chr = shift; unless ($chr =~ /^([\300-\377][\200-\277]+)/) { return ord $chr; # 1 byte } my @ord = unpack 'C', $1; if($ord[0] & 0x20) { # 0xE0 .. 0xFF return ($ord[0] & 0x1F)<<12 \| ($ord[1] & 0x3F)<<6 \| $ord[2] & +0x3F; } else { return ($ord[0] & 0x1F)<<6 \| $ord[1] & 0x3F; } }` [download]	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Re2: UTF-8 to ISO-8859-1
by bart (Canon) on Mar 05, 2003 at 01:20 UTC

my(%encoding,%decoding);

sub UTF8::chr ($) {
    my $ord = shift;
    if($ord && $ord < 0x80) {
        return chr $ord;   # OR:  pack 'C', $ord;
    } elsif ($ord < 0x800) {
        return pack 'C2', 0xC0 | ($ord>>6), 0x80 | ($ord & 0x3F);
    } else {
        return pack 'C3', 0xE0 | ($ord>>12), 0x80 | (($ord>>6) & 0x3F)
+, 0x80 | ($ord & 0x3F);
    }
}

#initialize
for my $ord (0, 128 .. 256) {
    $encoding{chr $ord} = UTF8::chr($ord);
}
%decoding = reverse %encoding;

sub UTF8_to_L1 {
    foreach (@_ = @_) {
        s/(\000|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xFF][\x80-\xBF][\x80-\xB
+F])/$decoding{$1} || "(#$1#)"/ge;
    }
    return wantarray ? @_ : pop;
}

sub L1_to_UTF8 {
    foreach (@_ = @_) {
        s/([\000\x80-\xFF])/$encoding{$1}/g;
    }
    return wantarray?@arg:$arg[-1];
}
[download]

pack('C0a*', $string)

For completeness sake, here's a sub to turn UTF-8 strings into a ordinal:

sub UTF8::ord ($) {
    my $chr = shift;
    unless ($chr =~ /^([\300-\377][\200-\277]+)/) {
        return ord $chr; # 1 byte
    }
    my @ord = unpack 'C*', $1;
    if($ord[0] & 0x20) {  # 0xE0 .. 0xFF
        return ($ord[0] & 0x1F)<<12 | ($ord[1] & 0x3F)<<6 | $ord[2] & 
+0x3F;
    } else {
        return ($ord[0] & 0x1F)<<6 | $ord[1] & 0x3F;
    }
}
[download]

[reply]
[d/l]
[select]