I was testing the module HTML::Clean to make a filter flag to the output of mod_perl for HPL (another HTML/Perl embed). But when I started to see the source, how the code is cleaned, I saw that the filter can make some mistakes with complex HTML. So I decided to make my own filter, but one that doesn't change the final result in the browser. I made some tests with HTML::Clean and my new module, and saw that I got a better filter (without changes in the result) and that clean better/more. (I have used www.cnn.com.br & www.perl.com pages that have styles, javascript, etc...)

What I want is not say what is better or not, actually the HTML::Clean idea to make a filter based in direct changes with RE is good, since use less memory, but it can't know exactly what it does inside the HTML tree. But we can't make a filter full based in parsed HTML tree, since this will be slow, what is not good for a server. My module is something between the 2 ways, and try to look in the basic things that can be cleaned, not very complex ideas, to keep it fast.

I was talking with the author (for now just sent an e-mail, waiting reply) to make some update to the module HTML::Clean with the code that I made. But the code has only 2 days of life, and need tests. I would like that the monks test the code with some Web Sites and see if the output was ok, the same, in the browser. Any idea to make the filter better or comments are gladly accepted!

To test get: http://www.inf.ufsc.br/~gmpassos/htmlclean.zip
Is very small and the test script has only 2 files, and doesn't need to install anything/modules in your Perl.

Graciliano M. P.
"The creativity is the expression of the liberty".


In reply to Cleanning HTML - New/better module for that - test please! ;-P by gmpassos

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.