Cleanning HTML - New/better module for that

I was testing the module HTML::Clean to make a filter flag to the output of mod_perl for HPL (another HTML/Perl embed). But when I started to see the source, how the code is cleaned, I saw that the filter can make some mistakes with complex HTML. So I decided to make my own filter, but one that doesn't change the final result in the browser. I made some tests with HTML::Clean and my new module, and saw that I got a better filter (without changes in the result) and that clean better/more. (I have used www.cnn.com.br & www.perl.com pages that have styles, javascript, etc...)

What I want is not say what is better or not, actually the HTML::Clean idea to make a filter based in direct changes with RE is good, since use less memory, but it can't know exactly what it does inside the HTML tree. But we can't make a filter full based in parsed HTML tree, since this will be slow, what is not good for a server. My module is something between the 2 ways, and try to look in the basic things that can be cleaned, not very complex ideas, to keep it fast.

I was talking with the author (for now just sent an e-mail, waiting reply) to make some update to the module HTML::Clean with the code that I made. But the code has only 2 days of life, and need tests. I would like that the monks test the code with some Web Sites and see if the output was ok, the same, in the browser. Any idea to make the filter better or comments are gladly accepted!

To test get: http://www.inf.ufsc.br/~gmpassos/htmlclean.zip
Is very small and the test script has only 2 files, and doesn't need to install anything/modules in your Perl.

Graciliano M. P.
"The creativity is the expression of the liberty".

Comment on Cleanning HTML - New/better module for that - test please! ;-P

Replies are listed 'Best First'.

Re: Cleanning HTML - New/better module for that - test please! ;-P
by thpfft (Chaplain) on Apr 22, 2003 at 12:32 UTC

Regex-based html processing is generally not regarded as a good idea: it's unreliable, labour-intensive, demanding to maintain and very very difficult to get right. The vast majority of respectable solutions are based on HTML::Parser, either directly or by way of one of the modules that put a simpler interface on it. Ovid's HTML::TokeParser::Simple is probably the one I'd recommend.My own HTML::TagFilter is simpler, but not as good (and not at all diligently maintained :).

If your goal is just to clean, rather than to digest and process, then you would also do well to try HTML::Tidy, a perl interface to the venerable but very effective htmltidy library.

I'm afraid you will almost certainly find that this wheel has already been made for you and that only a half-dozen lines of code are required...

[reply]

Re: Re: Cleanning HTML - New/better module for that - test please! ;-P

by PodMaster (Abbot) on Apr 22, 2003 at 12:55 UTC

japhy

YAPE::HTML - Yet Another Parser/Extractor for HTML

MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
** The Third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re: Re: Cleanning HTML - New/better module for that - test please! ;-P

by gmpassos (Priest) on Apr 22, 2003 at 15:50 UTC

Since what I want is only clean HTML in a fast way, I can't parse the HTML with a full tree. Note that the idea is to filter the output of mod_perl, or any CGI, to make the HTML smaller, and this can't be slow or use much memory/CPU or will be bad for the server, without advantages.

I tested htmltidy (http://tidy.sourceforge.net/) and saw that it's good to fix bugs in the HTML and to apply a style to it, not to clean the code!

Graciliano M. P.
"The creativity is the expression of the liberty".

[reply]

what is your definition of clean code?

by g00n (Hermit) on Apr 23, 2003 at 03:43 UTC

tidy intro - When editing HTML it's easy to make mistakes. Wouldn't it be nice if there was a simple way to fix these mistakes automatically and tidy up sloppy editing into nicely layed out markup?

clean the code

[reply]

Re: what is your definition of clean code?

by tachyon (Chancellor) on Apr 23, 2003 at 04:35 UTC

Re: what is your definition of clean code?

by gmpassos (Priest) on Apr 23, 2003 at 05:03 UTC

Re^2: Cleanning HTML - New/better module (regexes for html)

by Aristotle (Chancellor) on Apr 26, 2003 at 02:37 UTC

HTML::Parser

Makeshifts last the longest.

[reply]

Re: Re^2: Cleanning HTML - New/better module for that - test please! ;-P

by thpfft (Chaplain) on Apr 27, 2003 at 18:59 UTC

It is true, of course, that it would be very difficult to recreate HTML::Parser in pure perl without using any regexes, though it does not follow from there that it is a good idea to recreate HTML::Parser in pure perl.

It is also true that factors you describe are orthogonal, but only if you restrict the phrase 'use pattern matching' to its most drily correct application. In more informal usage it is common to talk of 'using regexes' as one way of parsing html and 'using the parser' as another, better way. I speak from chastening experience here.

So, to clarify, you are advising the OP to write his own parser in perl using plenty of regexes, and to restrict himself to only the most exact usage of words and operators? Which doesn't seem very perly, but I'm only a lowly bishop and easily muddled :)

[reply]

Re^4: Cleanning HTML - New/better module (out of hand dismissal?)

by Aristotle (Chancellor) on Apr 27, 2003 at 19:16 UTC

Re5: Cleanning HTML - New/better module for that - test please! ;-P

by thpfft (Chaplain) on Apr 27, 2003 at 23:36 UTC