Your Mother has asked for the wisdom of the Perl Monks concerning the following question:

I'm embarrassed I don't know the answer to this since I work on this stuff all the time. I suspect there is already something that does just what I want which is to turn HTML into plain text while maintaining some semblance of layout semantics, e.g.-

Would become
    * List item
and links would be shown with their href attribute. I have no need to handle tables with this.

So, please forgive me if I'm being a dummy. Script (to look at) or package for this? I can do it from scratch but would rather not if it's already out there.

  • Comment on Strip HTML, while preserving layout, with core(-ish) modules

Replies are listed 'Best First'.
Re: Strip HTML, while preserving layout, with core(-ish) modules
by oko1 (Deacon) on Apr 08, 2008 at 22:47 UTC

    Non-Perlish, but - I usually use either 'lynx -dump <filename>' if I don't care about layout, and 'w3m -dump <filename>' if I do. Unfortunately, the latter does not handle the links in the way you've asked, while the former does.

    
    -- 
    Human history becomes more and more a race between education and catastrophe. -- HG Wells
    

      Thanks! (The first draft of my question contained a reference to lynx though I didn't play around with it yet.)

Re: Strip HTML, while preserving layout, with core(-ish) modules
by ww (Archbishop) on Apr 08, 2008 at 22:43 UTC

    Since I can't think of anything appropriate that's core, some "core-ish" modules may have to do. If you're on w32, using ActiveState,

    ppm search HTML

    HTML-Content-Extractor (hyphens OK, not "::"), HTML-TagReader, and YAPE-HTML are just a few that may be relevant, but I suspect you'll have to code your own semantic conversions.

    If on a nixish OS, search CPAN, likewise for "HTML."

Re: Strip HTML, while preserving layout, with core(-ish) modules
by perrin (Chancellor) on Apr 09, 2008 at 13:22 UTC
    I've used a couple of random html2txt perl scripts I found through Google. They were okay, but formatting wasn't great. I think w3m will do a better job.