in reply to Cleanning HTML - New/better module for that - test please! ;-P

Regex-based html processing is generally not regarded as a good idea: it's unreliable, labour-intensive, demanding to maintain and very very difficult to get right. The vast majority of respectable solutions are based on HTML::Parser, either directly or by way of one of the modules that put a simpler interface on it. Ovid's HTML::TokeParser::Simple is probably the one I'd recommend.My own HTML::TagFilter is simpler, but not as good (and not at all diligently maintained :).

If your goal is just to clean, rather than to digest and process, then you would also do well to try HTML::Tidy, a perl interface to the venerable but very effective htmltidy library.

I'm afraid you will almost certainly find that this wheel has already been made for you and that only a half-dozen lines of code are required...

  • Comment on Re: Cleanning HTML - New/better module for that - test please! ;-P

Replies are listed 'Best First'.
Re: Re: Cleanning HTML - New/better module for that - test please! ;-P
by PodMaster (Abbot) on Apr 22, 2003 at 12:55 UTC
    I'd like to point out that japhy wrote one (regex based parser).

    YAPE::HTML - Yet Another Parser/Extractor for HTML


    MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
    I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
    ** The Third rule of perl club is a statement of fact: pod is sexy.

Re: Re: Cleanning HTML - New/better module for that - test please! ;-P
by gmpassos (Priest) on Apr 22, 2003 at 15:50 UTC
    If you parse the HTML tag by tag you can make a good work with REGEX, and is what I made, not a regex filter directly in the full HTML source. Is like a pure Perl parser that use the ability of REGEX to make it faster.

    Since what I want is only clean HTML in a fast way, I can't parse the HTML with a full tree. Note that the idea is to filter the output of mod_perl, or any CGI, to make the HTML smaller, and this can't be slow or use much memory/CPU or will be bad for the server, without advantages.

    I tested htmltidy (http://tidy.sourceforge.net/) and saw that it's good to fix bugs in the HTML and to apply a style to it, not to clean the code!

    Graciliano M. P.
    "The creativity is the expression of the liberty".

      tidy intro - When editing HTML it's easy to make mistakes. Wouldn't it be nice if there was a simple way to fix these mistakes automatically and tidy up sloppy editing into nicely layed out markup?

      could you tell me what is your definition of clean the code? could you provide an example?

        HTML Tidy is an excellent little widget for checking that HTML conforms to the W3C HTML spec and fixing errors as well a clening up indentation etc. It has nothing to do with Perl per se. As to examples there are many. For example Netscape/Mozilla is very particular about closing table tags. If you forget a </table> or have extra ones then really odd stuff will happen or the page will simply fail to display. To generate your own examples take some HTML, run it through tidy and RTFO where O = Output and the rest has the usual meaning.

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

        As I said, I want to clean the code, make it smaller (this is for the browser, not humans). The idea is to cut everything that not represent any visual thing in the browser and rewrite some parts/tags with less bytes, like cut quotes when is possible, spaces, etc...

        Example? Test the code of the main node with the www.cnn.com.br url and see the result of cleaned code with the original.

        Graciliano M. P.
        "The creativity is the expression of the liberty".

Re^2: Cleanning HTML - New/better module (regexes for html)
by Aristotle (Chancellor) on Apr 26, 2003 at 02:37 UTC
    Suppose you want to write HTML::Parser in pure Perl. (Or is it already?) What would you use for the job? - You guessed it. The opposite of parsing HTML is treating it as an unstructured stream of characters - whether you use pattern matching is orthogonal to the approach taken.

    Makeshifts last the longest.

      It is true, of course, that it would be very difficult to recreate HTML::Parser in pure perl without using any regexes, though it does not follow from there that it is a good idea to recreate HTML::Parser in pure perl.

      It is also true that factors you describe are orthogonal, but only if you restrict the phrase 'use pattern matching' to its most drily correct application. In more informal usage it is common to talk of 'using regexes' as one way of parsing html and 'using the parser' as another, better way. I speak from chastening experience here.

      So, to clarify, you are advising the OP to write his own parser in perl using plenty of regexes, and to restrict himself to only the most exact usage of words and operators? Which doesn't seem very perly, but I'm only a lowly bishop and easily muddled :)

        Whatever your rank is or mine doesn't have anything to do with it.

        I'm not saying anything about any of the OP's points either - yes, he would probably be better off using HTML::Parser. (There are reasons against this too, sometimes. Depends on too many factors to discuss here, I'll just assume you know what I mean.)

        What I was pointing out is that you saw pattern matching and assumed he was 'using regexes' as in common parlance. But pattern matching can (and pretty much has to) be used for a proper parser too, so before you throw out blanket statements like "don't use regexes for parsing HTML" please have a look at what he's actually doing.

        (His parser is defective - there are really three modes in *ML: text, tags, and attribute tag values. You have to parse the value assigned to an attribute separately from the tag- and attribute names, mainly because right angle brackets appearing inside an attribute value don't terminate a tag. gmpassos' code doesn't take this into account.)

        Makeshifts last the longest.