ryantate has asked for the wisdom of the Perl Monks concerning the following question:
If I print this string to a filehandle with :utf8 layer set (via binmode or at open), will the utf8 be garbled somehow, since the string is not marked/known utf8? Or will everything be OK, since the utf8 string is in byte mode anyway?
I have read perlunicode, perluniintro, utf8, encoding, Encode and other docs and this is one thing I cannot figure out.
This is important because LWP does not seem to set the utf8 flag for UTF-8 encoded HTML strings, even when UTF-8 is sent in the appropriate HTTP header. And I am doing some scraping on these pages and printing some of the text back out on a filehandle where I set the layer to ':utf8' and I'm hoping this will Just Work.
If not, I need to add some code that takes HTML::Encoding to the HTTP response and even the HTML itself to sniff out the encoding and, if present, call Encode::decode() on it.
An example of LWP not setting the utf8 flag on UTF-8 encoded HTML:
use strict; use warnings; use LWP::Simple; use Encode; #I have confirmed the correct HTTP header sent here my $html = get('http://-redacted-/file.utf8.html'); print "utf8: " . utf8::is_utf8($html) . "\n"; #Output is utf8: my $html2 = decode('UTF-8' => $html); print "utf8: " . utf8::is_utf8($html2) . "\n"; #Output is utf8: 1
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Printing undecoded utf8 -- safe?
by bart (Canon) on Mar 06, 2006 at 08:35 UTC | |
by ryantate (Friar) on Mar 06, 2006 at 17:53 UTC |