Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
We use modules like XML::RSS for other projects, but have to do it manually in this case. The question here involves the triple decoding process. From our perspective this is not an actual problem because it seems to work :-) However, before we find quadruple encoded documents it might be wise to find a reliable way to test for the presence of entities in a string before decoding it, so we can recurse.
#!/usr/bin/perl -w use strict; use Encode; use HTML::Entities; use HTTP::Request; use LWP::UserAgent; my $url = shift || 'http://perlmonks.com/headlines.rdf'; my $ua = LWP::UserAgent->new(); my $req = HTTP::Request->new(GET => $url); my $res = $ua->request($req); die $res->status_line unless $res->is_success; $res = $res->content; while ($res =~ s,<title[^>]*>\s*(.*?)\s*</title>,,si) { my $title = $1 || ''; next unless $title; $title = HTML::Entities::decode($title); $title = HTML::Entities::decode($title); $title = HTML::Entities::decode($title); $title = Encode::decode_utf8($title); $title =~ s/strip_stuff_like_html_and_cdata_tags//g; print HTML::Entities::encode($title), "\n"; }
|
|---|