Re: UTF-8 to ISO-8859-1

On 5.6.x, it can be as simple as

$latin1 = pack 'C*', unpack 'U*', $utf8;
[download]

On 5.8.0 (and later), use the Encode module.

Some of the solutions proposed here only work well on pre 5.6 systems, because from 5.6.0 on, perl has built-in magic that automatically converts Latin1 back to UTF-8 (without you asking for it). Like this (on 5.6.1):

$latin1 = 'élève';
$utf8 = chr(8801);
print join ' ', $latin1, $utf8;
[download]

Result: Ã©lÃ¨ve â‰¡
[download]
As you can see, the Latin1 is converted into UTF-8. This will render a lot of the code that used to work on 5.005 and earlier, useless: you can't turn UTF-8 to Latin1, as perl will undo your replacements.

The mechanism that is behind all that, is that each string has a flag attached to it, much like the taint flag, indicating whether a string is in UTF-8 or in bytes. When you join strings of bytes to strings in UTF-8, perl will convert the bytes strings to UTF-8. The end string is marked as UTF-8 as well. Personally I really really hate this behaviour.

There are ways around it: in 5.6, using pack, you can turn a string to bytes or to UTF-8, without the bytes themselves being touched, effectively only setting or clearing this UTF-8 flag on the resulting string.

$bytes = pack 'C0a*', $utf8;
$utf8 = pack 'U0a*', $bytes;
[download]

See the docs on pack for 5.6.1. Search for "C0".

5.8 has less hackish methods built in. See utf8 and Encode.

Comment on Re: UTF-8 to ISO-8859-1 Select or Download Code

Replies are listed 'Best First'.
Re2: UTF-8 to ISO-8859-1 by dragonchild (Archbishop) on Mar 04, 2003 at 16:25 UTC
What's the solution in 5.005_03? ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply]
Re: Re2: UTF-8 to ISO-8859-1 by bart (Canon) on Mar 05, 2003 at 01:20 UTC
Well, several solutions have been pointed to. Here's one I have used myself. my(%encoding,%decoding); sub UTF8::chr ($) { my $ord = shift; if($ord && $ord < 0x80) { return chr $ord; # OR: pack 'C', $ord; } elsif ($ord < 0x800) { return pack 'C2', 0xC0 \| ($ord>>6), 0x80 \| ($ord & 0x3F); } else { return pack 'C3', 0xE0 \| ($ord>>12), 0x80 \| (($ord>>6) & 0x3F) +, 0x80 \| ($ord & 0x3F); } } #initialize for my $ord (0, 128 .. 256) { $encoding{chr $ord} = UTF8::chr($ord); } %decoding = reverse %encoding; sub UTF8_to_L1 { foreach (@_ = @_) { s/(\000\|[\xC0-\xDF][\x80-\xBF]\|[\xE0-\xFF][\x80-\xBF][\x80-\xB +F])/$decoding{$1} \|\| "(#$1#)"/ge; } return wantarray ? @_ : pop; } sub L1_to_UTF8 { foreach (@_ = @_) { s/([\000\x80-\xFF])/$encoding{$1}/g; } return wantarray?@arg:$arg[-1]; } [download] In order to make it work for 5.6 too, you need to "disarm" the UTF-8 strings in the UTF8_to_L1 sub, for example using `pack('C0a', $string)` For completeness sake, here's a sub to turn UTF-8 strings into a ordinal: `sub UTF8::ord ($) { my $chr = shift; unless ($chr =~ /^([\300-\377][\200-\277]+)/) { return ord $chr; # 1 byte } my @ord = unpack 'C', $1; if($ord[0] & 0x20) { # 0xE0 .. 0xFF return ($ord[0] & 0x1F)<<12 \| ($ord[1] & 0x3F)<<6 \| $ord[2] & +0x3F; } else { return ($ord[0] & 0x1F)<<6 \| $ord[1] & 0x3F; } }` [download]	[reply] [d/l] [select]