Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Is there an option I can flip or am I stuck hand-decoding this?
use XML::LibXML ; binmode STDOUT, ':encoding(UTF-8)'; my $str = XML::LibXML->new( qw/ recover 2 / )->load_html( location => q{http://msdn.microsoft.com/en-us/library/aa664812(v=v +s.71).aspx}, )->find( q{/html/body/div/div[2]/div[2]/div[3]/div[3]/dl[15]/dd[29] } )->get_node(0)->textContent; print $str;
Running above, nbsp seems to be double-encoded (shows up as a-circumflex)
Seems as if libxml is returning utf-8 bytes but the string isn't marked as utf?
Adding use Encode; print decode('UTF-8', $str ); seems to resolve this, but I thought libxml could handle this without my help
What am I missing?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: ':encoding(UTF-8)' corrupts strings from XML::LibXML which doesn't return unicode strings ?
by tobyink (Canon) on Feb 28, 2013 at 23:24 UTC | |
by Anonymous Monk on Feb 28, 2013 at 23:39 UTC |