in reply to ':encoding(UTF-8)' corrupts strings from XML::LibXML which doesn't return unicode strings ?

It may be because the Microsoft website isn't indicating the document's UTF-8-ness in the HTTP headers. If you do the HTTP fetch outside XML::LibXML (using LWP::Simple), all is OK...

use LWP::Simple 'get'; use XML::LibXML; binmode STDOUT, ':encoding(UTF-8)'; my $str = XML::LibXML->new( qw/ recover 2 / )->load_html( string => get q{http://msdn.microsoft.com/en-us/library/aa664812(v +=vs.71).aspx}, )->find( q{/html/body/div/div[2]/div[2]/div[3]/div[3]/dl[15]/dd[29] } )->get_node(0)->textContent; print $str;
package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
  • Comment on Re: ':encoding(UTF-8)' corrupts strings from XML::LibXML which doesn't return unicode strings ?
  • Download Code

Replies are listed 'Best First'.
Re^2: ':encoding(UTF-8)' corrupts strings from XML::LibXML which doesn't return unicode strings ?
by Anonymous Monk on Feb 28, 2013 at 23:39 UTC

    It may be because the Microsoft website isn't indicating the document's UTF-8-ness in the HTTP headers

    Hmm, I got fooled by firefox, it said utf-8 :)

    Adding      encoding => 'UTF-8', to load_html also works

    On a related note, encoding option doesn't work with parse_html_file/new, but load_html location will gladly accept filenames/filepaths