HTML from single, double and triple encoded entities in RSS documents

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

A wide variety of encoded and unencoded content has been observed in the wild while parsing titles out of RSS feeds by hand with Perl. A number of modules are used to capture, manipulate and display proper HTML from these sources. Before working on the text we find inside title tags it's been found necessary to decode the text at least 3 times with HTML::Entities::decode before performing Encode::decode_utf8. At this point we can strip things that we don't want like HTML and CDATA tags. Then HTML::Entities::encode is used before display as HTML.

We use modules like XML::RSS for other projects, but have to do it manually in this case. The question here involves the triple decoding process. From our perspective this is not an actual problem because it seems to work :-) However, before we find quadruple encoded documents it might be wise to find a reliable way to test for the presence of entities in a string before decoding it, so we can recurse.

#!/usr/bin/perl -w

use strict;
use Encode;
use HTML::Entities;
use HTTP::Request;
use LWP::UserAgent;

my $url = shift || 'http://perlmonks.com/headlines.rdf';
my $ua  = LWP::UserAgent->new();
my $req = HTTP::Request->new(GET => $url);
my $res = $ua->request($req);

die $res->status_line unless $res->is_success;
$res = $res->content;

while ($res =~ s,<title[^>]*>\s*(.*?)\s*</title>,,si) {

  my $title = $1 || '';
  next unless $title;

  $title = HTML::Entities::decode($title);
  $title = HTML::Entities::decode($title);
  $title = HTML::Entities::decode($title);
  $title = Encode::decode_utf8($title);

  $title =~ s/strip_stuff_like_html_and_cdata_tags//g;

  print HTML::Entities::encode($title), "\n";
}
[download]

Comment on HTML from single, double and triple encoded entities in RSS documents Download Code

Replies are listed 'Best First'.
Re: HTML from single, double and triple encoded entities in RSS documents by Aristotle (Chancellor) on Jan 07, 2006 at 19:15 UTC
before we find quadruple encoded documents it might be wise to find a reliable way to test for the presence of entities in a string before decoding it, so we can recurse. There is no reliable way. A wide variety of encoded and unencoded content has been observed in the wild while parsing titles out of RSS feeds by hand with Perl. Yes. And it’s impossible to handle all feeds correctly. Give up. Generally, RSS titles should not contain markup. So per spec, they should be unescaped only once (which, if you were doing the right thing and using an XML parser, instead of groping around with a regex, would already have happened by the time you get the data). However, practically everyone double-encodes their titles, which allows carrying markup through them. Triple-encoded titles would be a bug; though I would not be surprised if that were slightly common (enough so that one would need to worry about it, that is). This and more are reasons why Atom (RFC 4287) was conceived: to provide a well-specified content model so that it’s always clear whether the producer or consumer of content is at fault when the data is misencoded. RSS does not afford such clarity. You simply don’t know what the data means. It’s mindboggling, I know, but true. Quoth <cite>Phil Ringnalda</cite>: I can’t believe how many times I have to relearn this fact. It must be a survival instinct, that makes me keep forgetting about this huge impossible to shift elephant in the middle of the room. If you need to use the character “<” in a feed title, which I only sort-of do in my weblog, but which another rather large project I’m peripherally involved with absolutely does, you have three choices: produce valid RSS which will fail with the classic “silent data loss” in virtually every reader currently available, knowingly produce invalid RSS because it will work perfectly in virtually every reader, and will not fail silently in the remaining ones, or, the only happy choice, use Atom instead since this problem is actually one of the primary reasons it started. See also: Sam Ruby: Détente Tim Bray: <’s Pointy End Of course, none of this helps you if you need to write software to consume RSS… but much as I wish I could say something which would, you’re simply out of luck. Welcome to the world of RSS. Makeshifts last the longest.	[reply]
Re^2: HTML from single, double and triple encoded entities in RSS documents by jhourcle (Prior) on Jan 08, 2006 at 12:52 UTC
See also Norman Walsh's Escaped Markup Considered Harmful, and his followup	[reply]
Re^3: HTML from single, double and triple encoded entities in RSS documents by Aristotle (Chancellor) on Jan 08, 2006 at 17:05 UTC
Agreed that it’s bad; I’ve only recently linked that article myself. But there’s nothing left to do about its unfortunate adoption in RSS, so the question is: faced with the reality of escaped markup, how do you parse it? Of course that would be easy to answer, if only there were a way to really know what is actual escaped markup and what is text. Makeshifts last the longest.	[reply]
Re^4: HTML from single, double and triple encoded entities in RSS documents by jhourcle (Prior) on Jan 08, 2006 at 22:08 UTC
Re^5: HTML from single, double and triple encoded entities in RSS documents by Aristotle (Chancellor) on Jan 09, 2006 at 02:44 UTC
Re: HTML from single, double and triple encoded entities in RSS documents by BrowserUk (Patriarch) on Jan 07, 2006 at 16:21 UTC
Why not just iterate until the length doesn't change anymore? `my( $l1, $l2 ) = length( $text ); $l1 = $l2 while ( $l2 = length( $text = HTML::Entities::decode( $text ) ) ) +< $l1;` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: HTML from single, double and triple encoded entities in RSS documents by Aristotle (Chancellor) on Jan 07, 2006 at 18:54 UTC
Because then you will turn `AT&T` into `AT&T`, which is invalid. And while you might not care because tagsoup rendering will still produce something readable, you’ll probably care that `<grin>` will turn into `<grin>`, causing the browser to silently ignore it as an unknown tag. What you are proposing is intentional silent data loss. Makeshifts last the longest.	[reply] [d/l] [select]
Re^3: HTML from single, double and triple encoded entities in RSS documents by BrowserUk (Patriarch) on Jan 07, 2006 at 19:54 UTC
Is that how you read the OPs intent? I thought about it, but if the requirement is to retain the final level of entities, then his hardcoded, 3 decodes is going belly up whenever he processes any that has been encoded less than 3 4 times. Even so, the logic of testing for a change in length works. You just have to retain 2 levels of 'undo' at each iteration. If the data being processed isn't too many megabytes each time, then something as simple as this would work regardless of how many times the content has been entity encoded: #! perl -slw use strict; use HTML::Entities; my $data = '<p><b><i>AT&T <grin></i></b></p>'; $data = HTML::Entities::encode( $data ) for 1 .. rand( 10 ); my @saved = $data; my $l1 = length $data; { my $l2 = length( $data = HTML::Entities::decode( $data ) ); if( $l2 < $l1 ) { push @saved, $data; $l1 = $l2; redo; } } $data = $saved[-2]; print $data; __END__ P:\test>junk2 <p><b><i>AT&T <grin></i></b></p> P:\test>junk2 <p><b><i>AT&T <grin></i></b></p> P:\test>junk2 <p><b><i>AT&T <grin></i></b></p> [download] I still think that the logic shown in the OPs code `$title =~ s/strip_stuff_like_html_and_cdata_tags//g;`, plus his description Before working on the text we find inside title tags suggests that he is interested in manipulating the content, not the markup. And that if this is ever destined to be redisplayed in a browser, (of which I see no mention?), it will probably be in a completely different context to that from which it was fetched. Which suggests to me that it would be better to extract the text content, remove all entities to allow for DB storage, pattern matching etc. and if it is ever going to be redisplayed in a browser, re-encode the content before combining it with the new markup. But you could be right. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^4: HTML from single, double and triple encoded entities in RSS documents by Aristotle (Chancellor) on Jan 07, 2006 at 23:40 UTC
Re^5: HTML from single, double and triple encoded entities in RSS documents by BrowserUk (Patriarch) on Jan 08, 2006 at 00:17 UTC
Some notes below your chosen depth have not been shown here