in reply to Re^3: HTML from single, double and triple encoded entities in RSS documents
in thread HTML from single, double and triple encoded entities in RSS documents
Unfortunately, that does not work either. Try this as sample input data:
my $data = 'But what about the &amp; entity? <sigh>';
That needs to stay as it is, but you will find that it gets over-decoded into But what about the & entity? <sigh>.
It’s impossible to reliably infer what the data means from looking at the data itself.
Really.
Sorry. :-(
I still think that the logic shown in the OPs code […] plus his description […] suggests that he is interested in manipulating the content, not the markup.
Sure, but he must first reliably identify which parts are markup and which are not, so that he can strip markup without stripping content. After stripping markup, then you can decode once more to resolve entities to characters. But if he over-decodes <sigh> to <sigh> in the first step, he’ll end up stripping it even though it was content.
There is just no way around it: you do not and cannot know what the data means. It may seem mind-boggling that a technology with such wide adoption has such a fundamental and unresolvable flaw, but it’s true.
(So anyone reading this who is planning to deploy syndication feeds: in the name of the sanity of feed reader developers, I implore you, please use the Atom format to publish, not RSS. You’ll do everyone a favour – including yourself and your readers.)
Makeshifts last the longest.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: HTML from single, double and triple encoded entities in RSS documents
by BrowserUk (Patriarch) on Jan 08, 2006 at 00:17 UTC | |
by Aristotle (Chancellor) on Jan 08, 2006 at 01:12 UTC |