':encoding(UTF-8)' corrupts strings from XML::LibXML which doesn't return unicode strings ?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

XML::LibXML::Parser/XML::LibXML::Node

Is there an option I can flip or am I stuck hand-decoding this?

 use XML::LibXML ;
binmode STDOUT, ':encoding(UTF-8)';
my $str = XML::LibXML->new(
    qw/ recover 2 /
)->load_html(
    location => q{http://msdn.microsoft.com/en-us/library/aa664812(v=v
+s.71).aspx},
)->find(
    q{/html/body/div/div[2]/div[2]/div[3]/div[3]/dl[15]/dd[29] }
)->get_node(0)->textContent;
print $str;
[download]

Running above, nbsp seems to be double-encoded (shows up as a-circumflex)

Seems as if libxml is returning utf-8 bytes but the string isn't marked as utf?

Adding use Encode; print decode('UTF-8', $str ); seems to resolve this, but I thought libxml could handle this without my help

What am I missing?

Comment on ':encoding(UTF-8)' corrupts strings from XML::LibXML which doesn't return unicode strings ? Select or Download Code

Replies are listed 'Best First'.
Re: ':encoding(UTF-8)' corrupts strings from XML::LibXML which doesn't return unicode strings ? by tobyink (Canon) on Feb 28, 2013 at 23:24 UTC
It may be because the Microsoft website isn't indicating the document's UTF-8-ness in the HTTP headers. If you do the HTTP fetch outside XML::LibXML (using LWP::Simple), all is OK... `use LWP::Simple 'get'; use XML::LibXML; binmode STDOUT, ':encoding(UTF-8)'; my $str = XML::LibXML->new( qw/ recover 2 / )->load_html( string => get q{http://msdn.microsoft.com/en-us/library/aa664812(v +=vs.71).aspx}, )->find( q{/html/body/div/div[2]/div[2]/div[3]/div[3]/dl[15]/dd[29] } )->get_node(0)->textContent; print $str;` [download] `package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name`	[reply] [d/l]
Re^2: ':encoding(UTF-8)' corrupts strings from XML::LibXML which doesn't return unicode strings ? by Anonymous Monk on Feb 28, 2013 at 23:39 UTC
It may be because the Microsoft website isn't indicating the document's UTF-8-ness in the HTTP headers Hmm, I got fooled by firefox, it said utf-8 :) Adding `encoding => 'UTF-8',` to load_html also works On a related note, encoding option doesn't work with parse_html_file/new, but load_html location will gladly accept filenames/filepaths	[reply] [d/l]