in reply to Re: Cleanning HTML - New/better module for that - test please! ;-P
in thread Cleanning HTML - New/better module for that - test please! ;-P

If you parse the HTML tag by tag you can make a good work with REGEX, and is what I made, not a regex filter directly in the full HTML source. Is like a pure Perl parser that use the ability of REGEX to make it faster.

Since what I want is only clean HTML in a fast way, I can't parse the HTML with a full tree. Note that the idea is to filter the output of mod_perl, or any CGI, to make the HTML smaller, and this can't be slow or use much memory/CPU or will be bad for the server, without advantages.

I tested htmltidy (http://tidy.sourceforge.net/) and saw that it's good to fix bugs in the HTML and to apply a style to it, not to clean the code!

Graciliano M. P.
"The creativity is the expression of the liberty".

  • Comment on Re: Re: Cleanning HTML - New/better module for that - test please! ;-P

Replies are listed 'Best First'.
what is your definition of clean code?
by g00n (Hermit) on Apr 23, 2003 at 03:43 UTC
    tidy intro - When editing HTML it's easy to make mistakes. Wouldn't it be nice if there was a simple way to fix these mistakes automatically and tidy up sloppy editing into nicely layed out markup?

    could you tell me what is your definition of clean the code? could you provide an example?

      HTML Tidy is an excellent little widget for checking that HTML conforms to the W3C HTML spec and fixing errors as well a clening up indentation etc. It has nothing to do with Perl per se. As to examples there are many. For example Netscape/Mozilla is very particular about closing table tags. If you forget a </table> or have extra ones then really odd stuff will happen or the page will simply fail to display. To generate your own examples take some HTML, run it through tidy and RTFO where O = Output and the rest has the usual meaning.

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      As I said, I want to clean the code, make it smaller (this is for the browser, not humans). The idea is to cut everything that not represent any visual thing in the browser and rewrite some parts/tags with less bytes, like cut quotes when is possible, spaces, etc...

      Example? Test the code of the main node with the www.cnn.com.br url and see the result of cleaned code with the original.

      Graciliano M. P.
      "The creativity is the expression of the liberty".