Unfortunately, that does not work either. Try this as sample input data:

my $data = 'But what about the & entity? <sigh>';

That needs to stay as it is, but you will find that it gets over-decoded into But what about the &amp; entity? <sigh>.

It’s impossible to reliably infer what the data means from looking at the data itself.

Really.

Sorry. :-(

I still think that the logic shown in the OPs code […] plus his description […] suggests that he is interested in manipulating the content, not the markup.

Sure, but he must first reliably identify which parts are markup and which are not, so that he can strip markup without stripping content. After stripping markup, then you can decode once more to resolve entities to characters. But if he over-decodes &lt;sigh> to <sigh> in the first step, he’ll end up stripping it even though it was content.

There is just no way around it: you do not and cannot know what the data means. It may seem mind-boggling that a technology with such wide adoption has such a fundamental and unresolvable flaw, but it’s true.

(So anyone reading this who is planning to deploy syndication feeds: in the name of the sanity of feed reader developers, I implore you, please use the Atom format to publish, not RSS. You’ll do everyone a favour – including yourself and your readers.)

Makeshifts last the longest.


In reply to Re^4: HTML from single, double and triple encoded entities in RSS documents by Aristotle
in thread HTML from single, double and triple encoded entities in RSS documents by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.