zby has asked for the wisdom of the Perl Monks concerning the following question:

In the code below I added much diagnostics to show the problem - but the basic situation is following. I have a UTF-8 file with some HTML entities. The HTML::LinkExtractor parser decodes HTML entities as Latin 1 characters so in the output I get a mix of UTF8 and Latin 1 and I have no idea what I could do with it.
use HTML::LinkExtractor; use Encode qw/_utf8_on is_utf8/; use open OUT => ':utf8'; my $utf8 = do { local $/; <DATA> }; my $LX = HTML::LinkExtractor->new(undef, undef, 1); $LX->parse(\$utf8); for my $l (@{$LX->links}){ if($l->{tag} eq 'a'){ my $character = substr($l->{_TEXT}, 0, 1); print "Character: $character\n"; print "Code: ", ord($character) , "\n"; print "utf8 flag: " , is_utf8($character), "\n"; _utf8_on($character); print "Character: $character\n"; print "Code: ", ord($character) , "\n"; print "utf8 flag: " , is_utf8($character), "\n"; } } __DATA__ <html> <HEAD> <META http-equiv=Content-Type content="text/html; charset=UTF-8"> </HEAD> <a href="http://www.pl/">&oacute;</a> </html> __OUTPUT__ Character: ó Code: 243 utf8 flag: Wide character in print at a.pl line 15, <DATA> line 1. Character: ó Malformed UTF-8 character (unexpected non-continuation byte 0x00, imme +diately after start byte 0xf3) in ord at a.pl line 16, <DATA> line 1. Code: 0 utf8 flag: 1
  • Comment on HTML::LinkExtractor interprets HTML entities according to the Latin1 encoding not the actual encoding
  • Download Code

Replies are listed 'Best First'.
Re: HTML::LinkExtractor interprets HTML entities according to the Latin1 encoding not the actual encoding
by PodMaster (Abbot) on Jun 24, 2005 at 16:10 UTC
    Since that module uses HTML::TokeParser for parsing the html, you should sheck the HTML::TokeParser documentation, which says:

    Note that the parsing result will likely not be valid if raw undecoded UTF-8 is used as a source. When parsing UTF-8 encoded files turn on UTF-8 decoding:

    open(my $fh, "<:utf8", "index.html") || die "Can't open 'index.html +': $!"; my $p = HTML::TokeParser->new( $fh ); # ...

    If a $filename is passed to the constructor the file will be opened in raw mode and the parsing result will only be valid if its content is Latin-1 or pure ASCII.

    If parsing from an UTF-8 encoded string buffer decode it first:

    utf8::decode($document); my $p = HTML::TokeParser->new( \$document ); # ...

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.