in reply to Can't locate object method "#COMMENT" via package "MooseX::Traits::__ANON__::SERIAL::1" at HTML/HTML5/ToText.pm line 129, <STDIN> line 3016.

And here is an encoding related bug in HTML::HTML5::ToText or HTML::HTML5::Parser

#!/usr/bin/perl -- use Test::More tests => 1; use HTML::HTML5::Parser; use HTML::HTML5::ToText; my $input = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitiona +l//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\ +r\n<head>\r\n<meta http-equiv=\"Content-Type\" content=\"text/html; c +harset=ISO-8859-1\" />\r\n<title> \x93literal smart quotes\x94 </titl +e>\r\n<body><p> num-ent apostrophe &#8217; </p>\n<p> num-ent double-d +ash &#8211; </p></body>\r\n</html>"; my $dom = HTML::HTML5::Parser->load_html( string => \$input ); my $str = HTML::HTML5::ToText->with_traits(qw/TextFormatting ShowLinks + ShowImages/)->process($dom); #~ use Data::Dump qw/ pp /; warn pp($str); #~ "\x{201C}literal smart quotes\x{201D}\n\nnum-ent apostrophe \xE2\x8 +0\x99\n\nnum-ent double-dash \xE2\x80\x93\n"; my $expected = "\x{201C}literal smart quotes\x{201D}\n\nnum-ent apostr +ophe \N{U+2019}\n\nnum-ent double-dash \N{U+2013}\n"; is $str, $expected;

\xE2\x80\x93 is the utf-8 encoding of \N{U+2013}, but its a byte string, appended to a perl-utf string, and so corrupted

  • Comment on Re: Can't locate object method "#COMMENT" via package "MooseX::Traits::__ANON__::SERIAL::1" at HTML/HTML5/ToText.pm line 129, <STDIN> line 3016. (encoding)
  • Download Code

Replies are listed 'Best First'.
Re^2: HTML::HTML5::Parser entities/utf-bytes/unicode isssue
by Anonymous Monk on Apr 17, 2013 at 16:54 UTC

    use Data::Dump qw/ dd pp /; die pp $dom->textContent; shows its an HTML::HTML5::Parser issue