Re: Cleanning HTML - New/better module for that

Replies are listed 'Best First'.

Re: Re: Cleanning HTML - New/better module for that - test please! ;-P
by PodMaster (Abbot) on Apr 22, 2003 at 12:55 UTC

japhy

YAPE::HTML - Yet Another Parser/Extractor for HTML

MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
** The Third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re: Re: Cleanning HTML - New/better module for that - test please! ;-P
by gmpassos (Priest) on Apr 22, 2003 at 15:50 UTC

Since what I want is only clean HTML in a fast way, I can't parse the HTML with a full tree. Note that the idea is to filter the output of mod_perl, or any CGI, to make the HTML smaller, and this can't be slow or use much memory/CPU or will be bad for the server, without advantages.

I tested htmltidy (http://tidy.sourceforge.net/) and saw that it's good to fix bugs in the HTML and to apply a style to it, not to clean the code!

Graciliano M. P.
"The creativity is the expression of the liberty".

[reply]

what is your definition of clean code?

by g00n (Hermit) on Apr 23, 2003 at 03:43 UTC

tidy intro - When editing HTML it's easy to make mistakes. Wouldn't it be nice if there was a simple way to fix these mistakes automatically and tidy up sloppy editing into nicely layed out markup?

clean the code

[reply]

Re: what is your definition of clean code?

by tachyon (Chancellor) on Apr 23, 2003 at 04:35 UTC

HTML Tidy is an excellent little widget for checking that HTML conforms to the W3C HTML spec and fixing errors as well a clening up indentation etc. It has nothing to do with Perl per se. As to examples there are many. For example Netscape/Mozilla is very particular about closing table tags. If you forget a </table> or have extra ones then really odd stuff will happen or the page will simply fail to display. To generate your own examples take some HTML, run it through tidy and RTFO where O = Output and the rest has the usual meaning.

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

[reply]
[d/l]

Re: what is your definition of clean code?

by gmpassos (Priest) on Apr 23, 2003 at 05:03 UTC

Example? Test the code of the main node with the www.cnn.com.br url and see the result of cleaned code with the original.

Graciliano M. P.
"The creativity is the expression of the liberty".

[reply]

Re^2: Cleanning HTML - New/better module (regexes for html)
by Aristotle (Chancellor) on Apr 26, 2003 at 02:37 UTC

HTML::Parser

Makeshifts last the longest.

[reply]

Re: Re^2: Cleanning HTML - New/better module for that - test please! ;-P

by thpfft (Chaplain) on Apr 27, 2003 at 18:59 UTC

It is true, of course, that it would be very difficult to recreate HTML::Parser in pure perl without using any regexes, though it does not follow from there that it is a good idea to recreate HTML::Parser in pure perl.

It is also true that factors you describe are orthogonal, but only if you restrict the phrase 'use pattern matching' to its most drily correct application. In more informal usage it is common to talk of 'using regexes' as one way of parsing html and 'using the parser' as another, better way. I speak from chastening experience here.

So, to clarify, you are advising the OP to write his own parser in perl using plenty of regexes, and to restrict himself to only the most exact usage of words and operators? Which doesn't seem very perly, but I'm only a lowly bishop and easily muddled :)

[reply]

Re^4: Cleanning HTML - New/better module (out of hand dismissal?)

by Aristotle (Chancellor) on Apr 27, 2003 at 19:16 UTC

Whatever your rank is or mine doesn't have anything to do with it.

I'm not saying anything about any of the OP's points either - yes, he would probably be better off using HTML::Parser. (There are reasons against this too, sometimes. Depends on too many factors to discuss here, I'll just assume you know what I mean.)

What I was pointing out is that you saw pattern matching and assumed he was 'using regexes' as in common parlance. But pattern matching can (and pretty much has to) be used for a proper parser too, so before you throw out blanket statements like "don't use regexes for parsing HTML" please have a look at what he's actually doing.

(His parser is defective - there are really three modes in *ML: text, tags, and attribute tag values. You have to parse the value assigned to an attribute separately from the tag- and attribute names, mainly because right angle brackets appearing inside an attribute value don't terminate a tag. gmpassos' code doesn't take this into account.)

Makeshifts last the longest.

[reply]

Re5: Cleanning HTML - New/better module for that - test please! ;-P

by thpfft (Chaplain) on Apr 27, 2003 at 23:36 UTC