http://qs1969.pair.com?node_id=313685

TVSET has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

This might sound like a FAQ to you, but I couldn't find the right answer yet.

My application needs the functionality of fast and easy convertion between HTML and Text. Until now I have done only convertion from formatted text to HTML (with HTML::FromText), which works great for me. Converting from HTML back to text though seems to be a problem. I've looked at modules like HTML::Parser and HTML::TokeParser, but they seem to be too complex and troublesome for what I need.

Any help is greatly appreciated. :)

Update:

First of all, I would like to thank everyone for help.

Secondly, after looking at my problem from a slightly different point of view, it appears that I do not need the HTML to text convertion at all. I just need a text version and a flag, indicating if I should pipe that text through the HTML::FromText or not. Practice makes perfect, I guess. :)

Replies are listed 'Best First'.
Re: HTML <=> Text convertion
by Roger (Parson) on Dec 10, 2003 at 11:32 UTC
    You could have a look at HTML::Strip (and optionally Text::Autoformat) on CPAN, I have used them and I was very happy with the results.

Re: HTML <=> Text convertion
by Aragorn (Curate) on Dec 10, 2003 at 11:32 UTC
    You can use an external text-browser like lynx to do the hard work for you. Open a pipe to lynx -dump <url> and read the resulting text-rendered page.

    Arjen

        Well, if it's your last resort, you are wasting a huge amount of your time and effort. Lazy Programmers -- and you do aspire to be one -- always use the quickest solution first.

        I prefer w3m -dump over lynx for generating plain text from HTML. It handles tables properly. It runs CGI locally for testing HTML output.

        If you are wanting text you can reformat easily, use the -cols option. It's your friend for stripping markup.

        --
        bowling trophy thieves, die!

Re: HTML <=> Text convertion
by Anonymous Monk on Dec 10, 2003 at 11:14 UTC
    converting html to text is tricky, stripping html is easier, after which you can format the text any way you want
Re: HTML <=> Text convertion
by dragonchild (Archbishop) on Dec 10, 2003 at 14:04 UTC
    This problem sounds like it suffers from a "Lack of Specification". You indicate that you want to convert back and forth from plaintext to HTML. However, there's a reason why there's two formats - they do different things. I ran into this when attempting to design a Document::Template that would handle PDF, Excel, and other formats. PDF and Excel are sufficiently different that it makes no sense, and HTML is even worse. A better question would be How can I convert a HTML table into fixed-width columns and back again? This is an easily solvable problem. (I could have a solution in 20 minutes and under 250 characters ... golf, anyone?)

    Now, mentioning PDF brings up another idea - there are HTML => PDF converters and PDF => plaintext converters. There are also plaintext => PDF converters, but no PDF => HTML converters (that I'm aware of). A big problem with converting from XXX => HTML is that HTMl is a non-deterministic format. I find it easier to consider HTML a "hinting format" instead of a "defining format" (like PDF). Browsers, to be compliant, are free to implement whatever they want, so long as they implement something. (This is how you have HTML-x compliant browsers for the blind.)

    ------
    We are the carpenters and bricklayers of the Information Age.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      This problem sounds like it suffers from a "Lack of Specification".

      Thank you! After thinking a bit more about my problem specification, I realized that I do not need an HTML to text convertion. All I need is a text version and a flag that will indicate if any HTML formatting should be done (with HTML::FromText module). Excellent!

      I have also upgrade the original post, just in case I will think about making the same mistake once again. :)

Re: HTML <=> Text convertion
by bart (Canon) on Dec 10, 2003 at 17:46 UTC
Re: HTML <=> Text convertion
by thpfft (Chaplain) on Dec 11, 2003 at 00:10 UTC
    use HTML::TagFilter; my $tf = new HTML::TagFilter->new( allow => {}, strip_comments => 1, ); my $text = $tf->filter($html);

    Compared to other offspring of the Parser, HTML::TagFilter is rather limited: it only does one thing, but it does it reasonably well. I think so anyway, but I wrote it :)

Re: HTML <=> Text convertion
by smishra (Novice) on Dec 10, 2003 at 15:05 UTC
    I have used HTML::FormatText to go from HTML to Text. It needs HTML::TreeBuilder also.
Re: HTML <=> Text convertion
by Anonymous Monk on Dec 10, 2003 at 23:49 UTC
    if you just want to strip tags and convert entities:
    s/<.*?>//g; s/&gt;/>/g; s/&lt;/</g; s/&apos;/'/g; s/&quot;/"/g/; s/&amp;/&/g;
    may do the trick