in reply to Re^3: Stripping HTML tags
in thread Stripping HTML tags

  1. I didn't provide the solution, I commented on it
  2. I don't have low standards for code - I suggested a different solution space (HTML::Strip or HTML::Parser) if this wasn't a one-off. Not all one-offs need to be robust.
  3. I gave an example of three places where the solution was "broken" - script and style tags, tags nested in comments (illegal but common) and doctype declarations.
  4. I make a suggestions for fixing those three specific issues.

Your point is taken, though - I don't say why I still don't think that the ammended solution is robust. Parsing HTML with a series of regexes is slow and difficult. style tags don't necessarily have endtags, for example: They could simply have a link to a .js file. Then, much later in the HTML document, if there was a closing script tag for another block, it would swallow and delete the enclosed valid content.

For performance, HTML::Stripper is an XS module, so it would be much, much faster than the multi-pass regex approach.

Replies are listed 'Best First'.
Re^5: Stripping HTML tags
by tilly (Archbishop) on May 25, 2005 at 01:01 UTC
    Allow me to clarify what I disagree with.

    Many people just take solutions from here, test minimally, and run with it. Therefore if people provide bad solutions and do not say they are bad, you're encouraging bad habits. I strongly dislike this.

    If someone provides a bad solution and says that it is bad and why, that's OK. If someone provides a bad solution and doesn't give any caveats, that's very much not OK in my books, and I don't want that person to get the message in any shape or form that what they are doing is even remotely fine to do.