Re: HTML::LinkExtractor interprets HTML entities according to the Latin1 encoding not the actual encoding

Since that module uses HTML::TokeParser for parsing the html, you should sheck the HTML::TokeParser documentation, which says:

Note that the parsing result will likely not be valid if raw undecoded UTF-8 is used as a source. When parsing UTF-8 encoded files turn on UTF-8 decoding:

open(my $fh, "<:utf8", "index.html") || die "Can't open 'index.html +': $!"; my $p = HTML::TokeParser->new( $fh ); # ...
[download]

If a $filename is passed to the constructor the file will be opened in raw mode and the parsing result will only be valid if its content is Latin-1 or pure ASCII.

If parsing from an UTF-8 encoded string buffer decode it first:

utf8::decode($document); my $p = HTML::TokeParser->new( \$document ); # ...
[download]

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

Comment on Re: HTML::LinkExtractor interprets HTML entities according to the Latin1 encoding not the actual encoding Select or Download Code

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.