in reply to HTML::LinkExtractor interprets HTML entities according to the Latin1 encoding not the actual encoding
Note that the parsing result will likely not be valid if raw undecoded UTF-8 is used as a source. When parsing UTF-8 encoded files turn on UTF-8 decoding:
open(my $fh, "<:utf8", "index.html") || die "Can't open 'index.html +': $!"; my $p = HTML::TokeParser->new( $fh ); # ...If a $filename is passed to the constructor the file will be opened in raw mode and the parsing result will only be valid if its content is Latin-1 or pure ASCII.
If parsing from an UTF-8 encoded string buffer decode it first:
utf8::decode($document); my $p = HTML::TokeParser->new( \$document ); # ...
|
|---|