Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

XML::LibXML::Parser/XML::LibXML::Node

Is there an option I can flip or am I stuck hand-decoding this?

use XML::LibXML ; binmode STDOUT, ':encoding(UTF-8)'; my $str = XML::LibXML->new( qw/ recover 2 / )->load_html( location => q{http://msdn.microsoft.com/en-us/library/aa664812(v=v +s.71).aspx}, )->find( q{/html/body/div/div[2]/div[2]/div[3]/div[3]/dl[15]/dd[29] } )->get_node(0)->textContent; print $str;

Running above, nbsp seems to be double-encoded (shows up as a-circumflex)

Seems as if libxml is returning utf-8 bytes but the string isn't marked as utf?

Adding  use Encode; print decode('UTF-8', $str ); seems to resolve this, but I thought libxml could handle this without my help

What am I missing?

  • Comment on ':encoding(UTF-8)' corrupts strings from XML::LibXML which doesn't return unicode strings ?
  • Select or Download Code

Replies are listed 'Best First'.
Re: ':encoding(UTF-8)' corrupts strings from XML::LibXML which doesn't return unicode strings ?
by tobyink (Canon) on Feb 28, 2013 at 23:24 UTC

    It may be because the Microsoft website isn't indicating the document's UTF-8-ness in the HTTP headers. If you do the HTTP fetch outside XML::LibXML (using LWP::Simple), all is OK...

    use LWP::Simple 'get'; use XML::LibXML; binmode STDOUT, ':encoding(UTF-8)'; my $str = XML::LibXML->new( qw/ recover 2 / )->load_html( string => get q{http://msdn.microsoft.com/en-us/library/aa664812(v +=vs.71).aspx}, )->find( q{/html/body/div/div[2]/div[2]/div[3]/div[3]/dl[15]/dd[29] } )->get_node(0)->textContent; print $str;
    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name

      It may be because the Microsoft website isn't indicating the document's UTF-8-ness in the HTTP headers

      Hmm, I got fooled by firefox, it said utf-8 :)

      Adding      encoding => 'UTF-8', to load_html also works

      On a related note, encoding option doesn't work with parse_html_file/new, but load_html location will gladly accept filenames/filepaths