http://qs1969.pair.com?node_id=244253

tsumay has asked for the wisdom of the Perl Monks concerning the following question:

Is there any way to save web pages directly into plain text format instead of having to save it to HTML then stripping out the tags? What I'm doing right now is saving web pages into HTML, then stripping out the tags. Then I thought, there has to be a more efficient way of doing this. Is there?

2006-10-25 Retitled by planetscape, as per Monastery guidelines: one-word (or module-only) titles hinder site navigation

( keep:2 edit:21 reap:0 )

Original title: 'HTMLtoText'

Replies are listed 'Best First'.
Re: How to save a web page directly to plain text?
by jdporter (Paladin) on Mar 19, 2003 at 06:35 UTC
    There is! One very easy way is to use the lynx text-mode browser to retrieve and save the file, using its -dump option. Not only does it print just the text, but attempts to format it (somewhat crudely) according to the html markup. lynx is available for most platforms, but unless you already have it, you might not consider this option "easy".

    Another way is to use LWP (or LWP::Simple) to retrieve the file, and one of the HTML parsing modules (such as HTML::TreeBuilder) to parse the text out of it. For example:
    my $URL = shift or die "Usage: $0 URL\n"; use LWP::Simple; use HTML::TreeBuilder; print HTML::TreeBuilder ->new_from_content( get( $URL ) or die "Error getting $URL\n" ) ->as_trimmed_text;

    jdporter
    The 6th Rule of Perl Club is -- There is no Rule #6.

Re: How to save a web page directly to plain text?
by allolex (Curate) on Mar 19, 2003 at 06:42 UTC

    How many pages? All browsers that I have ever used allow saving to plain text format. I frequently use lynx to do this sort of thing, and even better for getting pages that have frames (and no alternative) is links. The --dump option is the same.

    --
    Allolex

Re: How to save a web page directly to plain text?
by drfrog (Deacon) on Mar 19, 2003 at 18:51 UTC
    depending on how much work you have to do,
    you might want HTML::Parser to step up for yah.

    more on it at cpan of course