Re: Printing undecoded utf8 -- safe?

It looks to me like you're using perl 5.8.x, as Perl 5.6.x behaves is different than your description in this regard. My answer will be in the same context.

Yes the output will be garbled, as perl thinks the contents of the string is ISO-Latin-1, and it will be "helpfully" converted to UTF-8 in the process.

You could just set the UTF-8 flag on the string, and leave the bytes as they are. One way is to use the private function _utf8_on() in Encode — well, it's not exactly private, but you're advised to use it very sparingly. Another way is to use pack this way:

$perl_utf8 = pack 'U0a*', $raw_utf8;
[download]

I'd recommend to check if the UTF8 is in a "consistent state" afterwards, with utf8::valid(), for example.

p.s. I just came across this function in the docs for utf8:

utf8::decode($string)
Attempts to convert in-place the octet sequence in UTF-X to the corresponding character sequence. The UTF-8 flag is turned on only if the source string contains multiple-byte UTF-X characters. If $string is invalid as UTF-X, returns false; otherwise returns true.

I haven't tried it, but it sounds like something you could use.

Comment on Re: Printing undecoded utf8 -- safe? Select or Download Code

Replies are listed 'Best First'.
Re^2: Printing undecoded utf8 -- safe? by ryantate (Friar) on Mar 06, 2006 at 17:53 UTC
Thanks muchly. You are correct that I am on 5.8.x (x==4). utf8::decode I had not considered -- I thought maybe utf8::upgrade, but now it looks like that is only for actual Latin-1 strings. What I think I'll end up doing is use HTML::Encoding to properly sniff out the encoding of various docs I pull off the Web from LWP, since I shouldn't be making assumptions about their encoding anyway. (In this particular case I have one doc I know is UTF-8, but it is entirely possible I'll come across other encodings down the line.) Then use Encode::decode to decode each doc (to a Perl utf8 string, if I understand correctly) based on whatever encoding I get from HTML::Encoding. Tough going, this utf8 business.	[reply]

Replies are listed 'Best First'.

Re^2: Printing undecoded utf8 -- safe?
by ryantate (Friar) on Mar 06, 2006 at 17:53 UTC

utf8::decode I had not considered -- I thought maybe utf8::upgrade, but now it looks like that is only for actual Latin-1 strings.

What I think I'll end up doing is use HTML::Encoding to properly sniff out the encoding of various docs I pull off the Web from LWP, since I shouldn't be making assumptions about their encoding anyway. (In this particular case I have one doc I know is UTF-8, but it is entirely possible I'll come across other encodings down the line.) Then use Encode::decode to decode each doc (to a Perl utf8 string, if I understand correctly) based on whatever encoding I get from HTML::Encoding.

Tough going, this utf8 business.

[reply]