You can try HTML::Parser or HTML::TokeParser. | [reply] |
I found a really good one actually.. HTML::FormatText. works like a charm!
Yeah Tim, I was doing something like that previously but my application has gotten more a little more critical and due to network lapses etc, I need to make sure I get the file. So I had the choice of either writing a wrapper for lynx or writing a more flexible page getting program... I went with the latter.
Thanks for all your help.. it's greatly appreciated
| [reply] |
If you have lynx on the system you may as well just do
$text=`lynx -dump $url`;
vroom | Tim Vroom | vroom@cs.hope.edu
| [reply] [d/l] |
Well, you can do this (assuming that you've slurped the entire page to $page):
$page =~ s{< \s* BR .* >|< \s* P .* >}{$/}egisx;
$page =~ s{< .* >}{}egisx;
Then again, as Vroom said, if you have lynx on the system, it's much easier to just use the backquotes. | [reply] [d/l] |
Take a look at the program that runs news.perl.org
--the mailing list message is created by some Perl code
that produces an output quite similar to lynx -dump.
Specifically, look towards the bottom of the file for
the HTML::FormatText::AddRefs package; then look at the
get_mail_text routine. | [reply] |