Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

yeah, so I want to basically emulate lynx -dump in perl. I know about getting a page via LWP and whatnot, but I can not find a function that would turn a html page in to straight ascii, merely stripping useless html tags, and sticking \n's where < br > is, etc. Is there a module that will do this?

Replies are listed 'Best First'.
RE: making like `lynx -dump`
by Anonymous Monk on Mar 13, 2000 at 22:13 UTC
    You can try HTML::Parser or HTML::TokeParser.
      I found a really good one actually.. HTML::FormatText. works like a charm!

      Yeah Tim, I was doing something like that previously but my application has gotten more a little more critical and due to network lapses etc, I need to make sure I get the file. So I had the choice of either writing a wrapper for lynx or writing a more flexible page getting program... I went with the latter.

      Thanks for all your help.. it's greatly appreciated
RE: making like `lynx -dump`
by vroom (His Eminence) on Mar 14, 2000 at 00:07 UTC
Re: making like `lynx -dump`
by Anonymous Monk on Mar 14, 2000 at 01:47 UTC
    Well, you can do this (assuming that you've slurped the entire page to $page):
    $page =~ s{< \s* BR .* >|< \s* P .* >}{$/}egisx; $page =~ s{< .* >}{}egisx;
    Then again, as Vroom said, if you have lynx on the system, it's much easier to just use the backquotes.
Re: making like `lynx -dump`
by btrott (Parson) on Mar 14, 2000 at 04:48 UTC
    Take a look at the program that runs news.perl.org --the mailing list message is created by some Perl code that produces an output quite similar to lynx -dump.

    Specifically, look towards the bottom of the file for the HTML::FormatText::AddRefs package; then look at the get_mail_text routine.