Re: HTML from single, double and triple encoded entities in RSS documents

before we find quadruple encoded documents it might be wise to find a reliable way to test for the presence of entities in a string before decoding it, so we can recurse.

There is no reliable way.

A wide variety of encoded and unencoded content has been observed in the wild while parsing titles out of RSS feeds by hand with Perl.

Yes. And it’s impossible to handle all feeds correctly.

Give up.

Generally, RSS titles should not contain markup. So per spec, they should be unescaped only once (which, if you were doing the right thing and using an XML parser, instead of groping around with a regex, would already have happened by the time you get the data). However, practically everyone double-encodes their titles, which allows carrying markup through them. Triple-encoded titles would be a bug; though I would not be surprised if that were slightly common (enough so that one would need to worry about it, that is).

This and more are reasons why Atom (RFC 4287) was conceived: to provide a well-specified content model so that it’s always clear whether the producer or consumer of content is at fault when the data is misencoded.

RSS does not afford such clarity. You simply don’t know what the data means. It’s mindboggling, I know, but true. Quoth <cite>Phil Ringnalda</cite>:

I can’t believe how many times I have to relearn this fact. It must be a survival instinct, that makes me keep forgetting about this huge impossible to shift elephant in the middle of the room.

If you need to use the character “<” in a feed title, which I only sort-of do in my weblog, but which another rather large project I’m peripherally involved with absolutely does, you have three choices: produce valid RSS which will fail with the classic “silent data loss” in virtually every reader currently available, knowingly produce invalid RSS because it will work perfectly in virtually every reader, and will not fail silently in the remaining ones, or, the only happy choice, use Atom instead since this problem is actually one of the primary reasons it started.

See also:

Of course, none of this helps you if you need to write software to consume RSS… but much as I wish I could say something which would, you’re simply out of luck.

Welcome to the world of RSS.

Makeshifts last the longest.

Comment on Re: HTML from single, double and triple encoded entities in RSS documents

Replies are listed 'Best First'.
Re^2: HTML from single, double and triple encoded entities in RSS documents by jhourcle (Prior) on Jan 08, 2006 at 12:52 UTC
See also Norman Walsh's Escaped Markup Considered Harmful, and his followup	[reply]
Re^3: HTML from single, double and triple encoded entities in RSS documents by Aristotle (Chancellor) on Jan 08, 2006 at 17:05 UTC
Agreed that it’s bad; I’ve only recently linked that article myself. But there’s nothing left to do about its unfortunate adoption in RSS, so the question is: faced with the reality of escaped markup, how do you parse it? Of course that would be easy to answer, if only there were a way to really know what is actual escaped markup and what is text. Makeshifts last the longest.	[reply]
Re^4: HTML from single, double and triple encoded entities in RSS documents by jhourcle (Prior) on Jan 08, 2006 at 22:08 UTC
I'm of the personal opinion that you decode either zero or one times. It doesn't help for RSS, but the times that I've written my own schemas, I've used two separate types -- the normal 'string', which I didn't decode at all, and a type 'embedded_xml', which was decoded once, and only once. In dealing with CGI programming for the last 10 years or so, I've lost count of how many of the early cross site scripting flaws were from people using multiple-pass URI encoding, or multiple pass HTML encoding. (or both ... but technically, a single HTML encoded URI encoded URI is legal ... I use it for mailto links all the time)	[reply]
Re^5: HTML from single, double and triple encoded entities in RSS documents by Aristotle (Chancellor) on Jan 09, 2006 at 02:44 UTC