in reply to Re: HTML from single, double and triple encoded entities in RSS documents
in thread HTML from single, double and triple encoded entities in RSS documents

See also Norman Walsh's Escaped Markup Considered Harmful, and his followup

  • Comment on Re^2: HTML from single, double and triple encoded entities in RSS documents

Replies are listed 'Best First'.
Re^3: HTML from single, double and triple encoded entities in RSS documents
by Aristotle (Chancellor) on Jan 08, 2006 at 17:05 UTC

    Agreed that it’s bad; I’ve only recently linked that article myself. But there’s nothing left to do about its unfortunate adoption in RSS, so the question is: faced with the reality of escaped markup, how do you parse it?

    Of course that would be easy to answer, if only there were a way to really know what is actual escaped markup and what is text.

    Makeshifts last the longest.

      I'm of the personal opinion that you decode either zero or one times. It doesn't help for RSS, but the times that I've written my own schemas, I've used two separate types -- the normal 'string', which I didn't decode at all, and a type 'embedded_xml', which was decoded once, and only once.

      In dealing with CGI programming for the last 10 years or so, I've lost count of how many of the early cross site scripting flaws were from people using multiple-pass URI encoding, or multiple pass HTML encoding. (or both ... but technically, a single HTML encoded URI encoded URI is legal ... I use it for mailto links all the time)

        The common terminology refers to the point of view of someone looking at the XML in their editor, to whom the cases you describe look encoded either once or twice, rather than not or once. In case of RSS, titles should be assumed to be double-encoded (ie you get a once-decoded string from the XML parser then re-decode once more yourself it to reveal the markup as such). (And because RSS is too loosely specified this will yield incorrigibly wrong results in a sizable number of cases.)

        As for cross-site scripting attacks, that’s even further off the original topic than the question of whether to use embedded markup, and anyone interested in such matters such see HOWTO Avoid Being Called a Bozo When Producing XML for a comprehensive treatise of dos and don’ts.

        Makeshifts last the longest.