And here is an encoding related bug in HTML::HTML5::ToText or HTML::HTML5::Parser
#!/usr/bin/perl -- use Test::More tests => 1; use HTML::HTML5::Parser; use HTML::HTML5::ToText; my $input = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitiona +l//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\ +r\n<head>\r\n<meta http-equiv=\"Content-Type\" content=\"text/html; c +harset=ISO-8859-1\" />\r\n<title> \x93literal smart quotes\x94 </titl +e>\r\n<body><p> num-ent apostrophe ’ </p>\n<p> num-ent double-d +ash – </p></body>\r\n</html>"; my $dom = HTML::HTML5::Parser->load_html( string => \$input ); my $str = HTML::HTML5::ToText->with_traits(qw/TextFormatting ShowLinks + ShowImages/)->process($dom); #~ use Data::Dump qw/ pp /; warn pp($str); #~ "\x{201C}literal smart quotes\x{201D}\n\nnum-ent apostrophe \xE2\x8 +0\x99\n\nnum-ent double-dash \xE2\x80\x93\n"; my $expected = "\x{201C}literal smart quotes\x{201D}\n\nnum-ent apostr +ophe \N{U+2019}\n\nnum-ent double-dash \N{U+2013}\n"; is $str, $expected;
\xE2\x80\x93 is the utf-8 encoding of \N{U+2013}, but its a byte string, appended to a perl-utf string, and so corrupted
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: HTML::HTML5::Parser entities/utf-bytes/unicode isssue
by Anonymous Monk on Apr 17, 2013 at 16:54 UTC |