in reply to HTML from single, double and triple encoded entities in RSS documents

Why not just iterate until the length doesn't change anymore?

my( $l1, $l2 ) = length( $text ); $l1 = $l2 while ( $l2 = length( $text = HTML::Entities::decode( $text ) ) ) +< $l1;

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re: HTML from single, double and triple encoded entities in RSS documents
  • Download Code

Replies are listed 'Best First'.
Re^2: HTML from single, double and triple encoded entities in RSS documents
by Aristotle (Chancellor) on Jan 07, 2006 at 18:54 UTC

    Because then you will turn AT&amp;T into AT&T, which is invalid. And while you might not care because tagsoup rendering will still produce something readable, you’ll probably care that &lt;grin> will turn into <grin>, causing the browser to silently ignore it as an unknown tag.

    What you are proposing is intentional silent data loss.

    Makeshifts last the longest.

      Is that how you read the OPs intent? I thought about it, but if the requirement is to retain the final level of entities, then his hardcoded, 3 decodes is going belly up whenever he processes any that has been encoded less than 3 4 times.

      Even so, the logic of testing for a change in length works. You just have to retain 2 levels of 'undo' at each iteration. If the data being processed isn't too many megabytes each time, then something as simple as this would work regardless of how many times the content has been entity encoded:

      #! perl -slw use strict; use HTML::Entities; my $data = '<p><b><i>AT&amp;T &lt;grin></i></b></p>'; $data = HTML::Entities::encode( $data ) for 1 .. rand( 10 ); my @saved = $data; my $l1 = length $data; { my $l2 = length( $data = HTML::Entities::decode( $data ) ); if( $l2 < $l1 ) { push @saved, $data; $l1 = $l2; redo; } } $data = $saved[-2]; print $data; __END__ P:\test>junk2 <p><b><i>AT&amp;T &lt;grin></i></b></p> P:\test>junk2 <p><b><i>AT&amp;T &lt;grin></i></b></p> P:\test>junk2 <p><b><i>AT&amp;T &lt;grin></i></b></p>

      I still think that the logic shown in the OPs code $title =~ s/strip_stuff_like_html_and_cdata_tags//g;, plus his description

      Before working on the text we find inside title tags

      suggests that he is interested in manipulating the content, not the markup.

      And that if this is ever destined to be redisplayed in a browser, (of which I see no mention?), it will probably be in a completely different context to that from which it was fetched.

      Which suggests to me that it would be better to extract the text content, remove all entities to allow for DB storage, pattern matching etc. and if it is ever going to be redisplayed in a browser, re-encode the content before combining it with the new markup.

      But you could be right.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Unfortunately, that does not work either. Try this as sample input data:

        my $data = 'But what about the &amp;amp; entity? &lt;sigh>';

        That needs to stay as it is, but you will find that it gets over-decoded into But what about the &amp; entity? <sigh>.

        It’s impossible to reliably infer what the data means from looking at the data itself.

        Really.

        Sorry. :-(

        I still think that the logic shown in the OPs code […] plus his description […] suggests that he is interested in manipulating the content, not the markup.

        Sure, but he must first reliably identify which parts are markup and which are not, so that he can strip markup without stripping content. After stripping markup, then you can decode once more to resolve entities to characters. But if he over-decodes &lt;sigh> to <sigh> in the first step, he’ll end up stripping it even though it was content.

        There is just no way around it: you do not and cannot know what the data means. It may seem mind-boggling that a technology with such wide adoption has such a fundamental and unresolvable flaw, but it’s true.

        (So anyone reading this who is planning to deploy syndication feeds: in the name of the sanity of feed reader developers, I implore you, please use the Atom format to publish, not RSS. You’ll do everyone a favour – including yourself and your readers.)

        Makeshifts last the longest.