A wide variety of encoded and unencoded content has been observed in the wild while parsing titles out of RSS feeds by hand with Perl. A number of modules are used to capture, manipulate and display proper HTML from these sources. Before working on the text we find inside title tags it's been found necessary to decode the text at least 3 times with
HTML::Entities::decode before performing
Encode::decode_utf8. At this point we can strip things that we don't want like HTML and CDATA tags. Then HTML::Entities::encode is used before display as HTML.
We use modules like XML::RSS for other projects, but have to do it manually in this case. The question here involves the triple decoding process. From our perspective this is not an actual problem because it seems to work :-) However, before we find quadruple encoded documents it might be wise to find a reliable way to test for the presence of entities in a string before decoding it, so we can recurse.
#!/usr/bin/perl -w
use strict;
use Encode;
use HTML::Entities;
use HTTP::Request;
use LWP::UserAgent;
my $url = shift || 'http://perlmonks.com/headlines.rdf';
my $ua = LWP::UserAgent->new();
my $req = HTTP::Request->new(GET => $url);
my $res = $ua->request($req);
die $res->status_line unless $res->is_success;
$res = $res->content;
while ($res =~ s,<title[^>]*>\s*(.*?)\s*</title>,,si) {
my $title = $1 || '';
next unless $title;
$title = HTML::Entities::decode($title);
$title = HTML::Entities::decode($title);
$title = HTML::Entities::decode($title);
$title = Encode::decode_utf8($title);
$title =~ s/strip_stuff_like_html_and_cdata_tags//g;
print HTML::Entities::encode($title), "\n";
}
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.