Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I come to the gates of the Monks, asking forgivness and help.

I am trying to use HTML::TreeBuilder to remove ALL html tags from a document other then tags that would put spaces or line breaks in. Those should be replaced with <br> or <p>

Can anyone help me?

20050612 Janitored by Corion: Added formatting

Replies are listed 'Best First'.
Re: HTML::TreeBuilder Remove all HTML
by revdiablo (Prior) on Jun 12, 2005 at 21:27 UTC

    HTML::TreeBuilder is a fine module for parsing and working with HTML documents, but in this case you might want to look into using HTML::FormatText instead. It seems to do exactly what you're looking for, which will likely save you time and effort.

Re: HTML::TreeBuilder Remove all HTML
by borisz (Canon) on Jun 12, 2005 at 19:25 UTC
Re: HTML::TreeBuilder Remove all HTML
by GrandFather (Saint) on Jun 12, 2005 at 21:44 UTC
    or if you really want to use HTML::TreeBuilder something like this may do the trick:


    Perl is Huffman encoded by design.
Re: HTML::TreeBuilder Remove all HTML
by TedPride (Priest) on Jun 12, 2005 at 20:21 UTC
    Replace the tags that put spaces or line breaks in with non-tag characters or patterns, then remove all tags with the module specified above, and replace the non-tag characters or patterns with <br> and <p>.
Re: HTML::TreeBuilder Remove all HTML
by tphyahoo (Vicar) on Jun 13, 2005 at 08:05 UTC
    This is unclear to me. Do you mean, remove all tags except for <br>, <p> and &nbsp? (Where &nbsp isn't even really a tag!) Or do you mean some other thing?