in reply to Re: Re^2: Cleanning HTML - New/better module for that - test please! ;-P
in thread Cleanning HTML - New/better module for that - test please! ;-P

Whatever your rank is or mine doesn't have anything to do with it.

I'm not saying anything about any of the OP's points either - yes, he would probably be better off using HTML::Parser. (There are reasons against this too, sometimes. Depends on too many factors to discuss here, I'll just assume you know what I mean.)

What I was pointing out is that you saw pattern matching and assumed he was 'using regexes' as in common parlance. But pattern matching can (and pretty much has to) be used for a proper parser too, so before you throw out blanket statements like "don't use regexes for parsing HTML" please have a look at what he's actually doing.

(His parser is defective - there are really three modes in *ML: text, tags, and attribute tag values. You have to parse the value assigned to an attribute separately from the tag- and attribute names, mainly because right angle brackets appearing inside an attribute value don't terminate a tag. gmpassos' code doesn't take this into account.)

Makeshifts last the longest.

  • Comment on Re^4: Cleanning HTML - New/better module (out of hand dismissal?)

Replies are listed 'Best First'.
Re5: Cleanning HTML - New/better module for that - test please! ;-P
by thpfft (Chaplain) on Apr 27, 2003 at 23:36 UTC

    I know. Just teasing. But I did look at the original code quite closely, before I decided it was of the kind that merited what you describe as a blanket response. I came to that conclusion partly because the classic failure to deal with <img ...alt=">"> - that you mention - reveals a lack of acquaintance with the debate. For the record, I think gmpassos' code is rather good, as that sort of solution goes; certainly better than anything I managed before it was forcefully put to me that better mechanisms existed already. I should probably have said that, but I didn't really feel entitled to pass judgement on the quality, just the approach.

    btw, I don't know if you've read the discussion i linked to in the first post. It will put this one in useful context.