in reply to Re: Cleanning HTML - New/better module for that - test please! ;-P
in thread Cleanning HTML - New/better module for that - test please! ;-P

Suppose you want to write HTML::Parser in pure Perl. (Or is it already?) What would you use for the job? - You guessed it. The opposite of parsing HTML is treating it as an unstructured stream of characters - whether you use pattern matching is orthogonal to the approach taken.

Makeshifts last the longest.

  • Comment on Re^2: Cleanning HTML - New/better module (regexes for html)

Replies are listed 'Best First'.
Re: Re^2: Cleanning HTML - New/better module for that - test please! ;-P
by thpfft (Chaplain) on Apr 27, 2003 at 18:59 UTC

    It is true, of course, that it would be very difficult to recreate HTML::Parser in pure perl without using any regexes, though it does not follow from there that it is a good idea to recreate HTML::Parser in pure perl.

    It is also true that factors you describe are orthogonal, but only if you restrict the phrase 'use pattern matching' to its most drily correct application. In more informal usage it is common to talk of 'using regexes' as one way of parsing html and 'using the parser' as another, better way. I speak from chastening experience here.

    So, to clarify, you are advising the OP to write his own parser in perl using plenty of regexes, and to restrict himself to only the most exact usage of words and operators? Which doesn't seem very perly, but I'm only a lowly bishop and easily muddled :)

      Whatever your rank is or mine doesn't have anything to do with it.

      I'm not saying anything about any of the OP's points either - yes, he would probably be better off using HTML::Parser. (There are reasons against this too, sometimes. Depends on too many factors to discuss here, I'll just assume you know what I mean.)

      What I was pointing out is that you saw pattern matching and assumed he was 'using regexes' as in common parlance. But pattern matching can (and pretty much has to) be used for a proper parser too, so before you throw out blanket statements like "don't use regexes for parsing HTML" please have a look at what he's actually doing.

      (His parser is defective - there are really three modes in *ML: text, tags, and attribute tag values. You have to parse the value assigned to an attribute separately from the tag- and attribute names, mainly because right angle brackets appearing inside an attribute value don't terminate a tag. gmpassos' code doesn't take this into account.)

      Makeshifts last the longest.

        I know. Just teasing. But I did look at the original code quite closely, before I decided it was of the kind that merited what you describe as a blanket response. I came to that conclusion partly because the classic failure to deal with <img ...alt=">"> - that you mention - reveals a lack of acquaintance with the debate. For the record, I think gmpassos' code is rather good, as that sort of solution goes; certainly better than anything I managed before it was forcefully put to me that better mechanisms existed already. I should probably have said that, but I didn't really feel entitled to pass judgement on the quality, just the approach.

        btw, I don't know if you've read the discussion i linked to in the first post. It will put this one in useful context.