before we find quadruple encoded documents it might be wise to find a reliable way to test for the presence of entities in a string before decoding it, so we can recurse.

There is no reliable way.

A wide variety of encoded and unencoded content has been observed in the wild while parsing titles out of RSS feeds by hand with Perl.

Yes. And it’s impossible to handle all feeds correctly.

Give up.

Generally, RSS titles should not contain markup. So per spec, they should be unescaped only once (which, if you were doing the right thing and using an XML parser, instead of groping around with a regex, would already have happened by the time you get the data). However, practically everyone double-encodes their titles, which allows carrying markup through them. Triple-encoded titles would be a bug; though I would not be surprised if that were slightly common (enough so that one would need to worry about it, that is).

This and more are reasons why Atom (RFC 4287) was conceived: to provide a well-specified content model so that it’s always clear whether the producer or consumer of content is at fault when the data is misencoded.

RSS does not afford such clarity. You simply don’t know what the data means. It’s mindboggling, I know, but true. Quoth <cite>Phil Ringnalda</cite>:

I can’t believe how many times I have to relearn this fact. It must be a survival instinct, that makes me keep forgetting about this huge impossible to shift elephant in the middle of the room.

If you need to use the character “<” in a feed title, which I only sort-of do in my weblog, but which another rather large project I’m peripherally involved with absolutely does, you have three choices: produce valid RSS which will fail with the classic “silent data loss” in virtually every reader currently available, knowingly produce invalid RSS because it will work perfectly in virtually every reader, and will not fail silently in the remaining ones, or, the only happy choice, use Atom instead since this problem is actually one of the primary reasons it started.

See also:

Of course, none of this helps you if you need to write software to consume RSS… but much as I wish I could say something which would, you’re simply out of luck.

Welcome to the world of RSS.

Makeshifts last the longest.


In reply to Re: HTML from single, double and triple encoded entities in RSS documents by Aristotle
in thread HTML from single, double and triple encoded entities in RSS documents by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.