HTML::LinkExtractor interprets HTML entities according to the Latin1 encoding not the actual encoding

zby has asked for the wisdom of the Perl Monks concerning the following question:

In the code below I added much diagnostics to show the problem - but the basic situation is following. I have a UTF-8 file with some HTML entities. The HTML::LinkExtractor parser decodes HTML entities as Latin 1 characters so in the output I get a mix of UTF8 and Latin 1 and I have no idea what I could do with it.

use HTML::LinkExtractor;
use Encode qw/_utf8_on is_utf8/;
use open OUT => ':utf8';

my $utf8 = do { local $/; <DATA> };
my $LX = HTML::LinkExtractor->new(undef, undef, 1);
$LX->parse(\$utf8);
for my $l (@{$LX->links}){
    if($l->{tag} eq 'a'){
        my $character = substr($l->{_TEXT}, 0, 1);
        print "Character: $character\n";
        print "Code: ", ord($character) , "\n";
        print "utf8 flag: " , is_utf8($character), "\n";
        _utf8_on($character);
        print "Character: $character\n";
        print "Code: ", ord($character) , "\n";
        print "utf8 flag: " , is_utf8($character), "\n";
    }
}

__DATA__

<html>
<HEAD>
<META http-equiv=Content-Type content="text/html; charset=UTF-8">
</HEAD>
<a href="http://www.pl/">&oacute;</a>
</html>

__OUTPUT__
Character: ó
Code: 243
utf8 flag:
Wide character in print at a.pl line 15, <DATA> line 1.
Character: ó
Malformed UTF-8 character (unexpected non-continuation byte 0x00, imme
+diately after start byte 0xf3) in ord at a.pl line 16, <DATA> line 1.
Code: 0
utf8 flag: 1
[download]

Comment on HTML::LinkExtractor interprets HTML entities according to the Latin1 encoding not the actual encoding Download Code

Replies are listed 'Best First'.

Re: HTML::LinkExtractor interprets HTML entities according to the Latin1 encoding not the actual encoding
by PodMaster (Abbot) on Jun 24, 2005 at 16:10 UTC

HTML::TokeParser

Note that the parsing result will likely not be valid if raw undecoded UTF-8 is used as a source. When parsing UTF-8 encoded files turn on UTF-8 decoding:

open(my $fh, "<:utf8", "index.html") || die "Can't open 'index.html +': $!"; my $p = HTML::TokeParser->new( $fh ); # ...
[download]

If a $filename is passed to the constructor the file will be opened in raw mode and the parsing result will only be valid if its content is Latin-1 or pure ASCII.

If parsing from an UTF-8 encoded string buffer decode it first:

utf8::decode($document); my $p = HTML::TokeParser->new( \$document ); # ...
[download]

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]
[select]