Printing undecoded utf8 -- safe?

ryantate has asked for the wisdom of the Perl Monks concerning the following question:

I have a string that I know is utf8 but perl's utf8 flag is not set on the string and I have not run decode('UTF-8' => $string).

If I print this string to a filehandle with :utf8 layer set (via binmode or at open), will the utf8 be garbled somehow, since the string is not marked/known utf8? Or will everything be OK, since the utf8 string is in byte mode anyway?

I have read perlunicode, perluniintro, utf8, encoding, Encode and other docs and this is one thing I cannot figure out.

This is important because LWP does not seem to set the utf8 flag for UTF-8 encoded HTML strings, even when UTF-8 is sent in the appropriate HTTP header. And I am doing some scraping on these pages and printing some of the text back out on a filehandle where I set the layer to ':utf8' and I'm hoping this will Just Work.

If not, I need to add some code that takes HTML::Encoding to the HTTP response and even the HTML itself to sniff out the encoding and, if present, call Encode::decode() on it.

An example of LWP not setting the utf8 flag on UTF-8 encoded HTML:

use strict;
use warnings;

use LWP::Simple;
use Encode;

#I have confirmed the correct HTTP header sent here
my $html = get('http://-redacted-/file.utf8.html');

print "utf8: " . utf8::is_utf8($html) . "\n";
#Output is utf8: 

my $html2 = decode('UTF-8' => $html);

print "utf8: " . utf8::is_utf8($html2) . "\n";
#Output is utf8: 1
[download]

Comment on Printing undecoded utf8 -- safe? Download Code

Replies are listed 'Best First'.
Re: Printing undecoded utf8 -- safe? by bart (Canon) on Mar 06, 2006 at 08:35 UTC
It looks to me like you're using perl 5.8.x, as Perl 5.6.x behaves is different than your description in this regard. My answer will be in the same context. Yes the output will be garbled, as perl thinks the contents of the string is ISO-Latin-1, and it will be "helpfully" converted to UTF-8 in the process. You could just set the UTF-8 flag on the string, and leave the bytes as they are. One way is to use the private function `_utf8_on()` in Encode — well, it's not exactly private, but you're advised to use it very sparingly. Another way is to use pack this way: `$perl_utf8 = pack 'U0a*', $raw_utf8;` [download] I'd recommend to check if the UTF8 is in a "consistent state" afterwards, with `utf8::valid()`, for example. p.s. I just came across this function in the docs for utf8: utf8::decode($string) Attempts to convert in-place the octet sequence in UTF-X to the corresponding character sequence. The UTF-8 flag is turned on only if the source string contains multiple-byte UTF-X characters. If $string is invalid as UTF-X, returns false; otherwise returns true. I haven't tried it, but it sounds like something you could use.	[reply] [d/l] [select]
Re^2: Printing undecoded utf8 -- safe? by ryantate (Friar) on Mar 06, 2006 at 17:53 UTC
Thanks muchly. You are correct that I am on 5.8.x (x==4). utf8::decode I had not considered -- I thought maybe utf8::upgrade, but now it looks like that is only for actual Latin-1 strings. What I think I'll end up doing is use HTML::Encoding to properly sniff out the encoding of various docs I pull off the Web from LWP, since I shouldn't be making assumptions about their encoding anyway. (In this particular case I have one doc I know is UTF-8, but it is entirely possible I'll come across other encodings down the line.) Then use Encode::decode to decode each doc (to a Perl utf8 string, if I understand correctly) based on whatever encoding I get from HTML::Encoding. Tough going, this utf8 business.	[reply]