Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: How to save a web page directly to plain text?

by jdporter (Paladin)
on Mar 19, 2003 at 06:35 UTC ( [id://244264]=note: print w/replies, xml ) Need Help??


in reply to How to save a web page directly to plain text?

There is! One very easy way is to use the lynx text-mode browser to retrieve and save the file, using its -dump option. Not only does it print just the text, but attempts to format it (somewhat crudely) according to the html markup. lynx is available for most platforms, but unless you already have it, you might not consider this option "easy".

Another way is to use LWP (or LWP::Simple) to retrieve the file, and one of the HTML parsing modules (such as HTML::TreeBuilder) to parse the text out of it. For example:
my $URL = shift or die "Usage: $0 URL\n"; use LWP::Simple; use HTML::TreeBuilder; print HTML::TreeBuilder ->new_from_content( get( $URL ) or die "Error getting $URL\n" ) ->as_trimmed_text;

jdporter
The 6th Rule of Perl Club is -- There is no Rule #6.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://244264]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (2)
As of 2024-04-26 02:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found