zby has asked for the wisdom of the Perl Monks concerning the following question:
use HTML::LinkExtractor; use Encode qw/_utf8_on is_utf8/; use open OUT => ':utf8'; my $utf8 = do { local $/; <DATA> }; my $LX = HTML::LinkExtractor->new(undef, undef, 1); $LX->parse(\$utf8); for my $l (@{$LX->links}){ if($l->{tag} eq 'a'){ my $character = substr($l->{_TEXT}, 0, 1); print "Character: $character\n"; print "Code: ", ord($character) , "\n"; print "utf8 flag: " , is_utf8($character), "\n"; _utf8_on($character); print "Character: $character\n"; print "Code: ", ord($character) , "\n"; print "utf8 flag: " , is_utf8($character), "\n"; } } __DATA__ <html> <HEAD> <META http-equiv=Content-Type content="text/html; charset=UTF-8"> </HEAD> <a href="http://www.pl/">ó</a> </html> __OUTPUT__ Character: ó Code: 243 utf8 flag: Wide character in print at a.pl line 15, <DATA> line 1. Character: ó Malformed UTF-8 character (unexpected non-continuation byte 0x00, imme +diately after start byte 0xf3) in ord at a.pl line 16, <DATA> line 1. Code: 0 utf8 flag: 1
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: HTML::LinkExtractor interprets HTML entities according to the Latin1 encoding not the actual encoding
by PodMaster (Abbot) on Jun 24, 2005 at 16:10 UTC |