in reply to HTML::LinkExtractor interprets HTML entities according to the Latin1 encoding not the actual encoding

Since that module uses HTML::TokeParser for parsing the html, you should sheck the HTML::TokeParser documentation, which says:

Note that the parsing result will likely not be valid if raw undecoded UTF-8 is used as a source. When parsing UTF-8 encoded files turn on UTF-8 decoding:

open(my $fh, "<:utf8", "index.html") || die "Can't open 'index.html +': $!"; my $p = HTML::TokeParser->new( $fh ); # ...

If a $filename is passed to the constructor the file will be opened in raw mode and the parsing result will only be valid if its content is Latin-1 or pure ASCII.

If parsing from an UTF-8 encoded string buffer decode it first:

utf8::decode($document); my $p = HTML::TokeParser->new( \$document ); # ...

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

  • Comment on Re: HTML::LinkExtractor interprets HTML entities according to the Latin1 encoding not the actual encoding
  • Select or Download Code