tridral has asked for the wisdom of the Perl Monks concerning the following question:
I'm using HTML::TokeParser to read meta tags and titles. Where html entities have been used, they mostly come through unscathed as the characters they represent (eacute and egrave for example). rsquo and lsquo get mangled. rsquo coming out as the three characters a-circumflex, trademark, eurosymbol. Has anyone else seen this kind of behaviour, and is there any way of solving this?
Code snippets:
Get a token
Get the type e.g. 'S' for Start$tok_inf = $tok_par->get_token ;
If it's a start tag get more information e.g. name is 'title'$tok_typ = shift @{$tok_inf};
Get the title($tag_nam, $tag_att, $tag_seq, $tag_raw) = @{ $tok_inf } ;
$title = $tok_par->get_text() || "<NO TITLE FOUND>";
And it's the value in $title that's odd if rsquo and lsquo are used in the page title.
Many thanks for any help that you can offer.
|
|---|