Obtaining Text from a website

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Obtaining Text from a website by marto (Cardinal) on Jan 28, 2009 at 15:09 UTC
It is possible, it depends what you mean by "some text". For example WWW::Mechanize's `$mech->content()` method allows you to return a text only version of the page by using `$mech->content( format => 'text' )`, but I suspect you may want to parse out some specific information, rather than the entire page. There are several modules for parsing HTML such as HTML::Parser and HTML::TokeParser, also see HTML::TokeParser help - parsing headlines or use super search to find more examples. Martin	[reply] [d/l] [select]
Re: Obtaining Text from a website by whakka (Hermit) on Jan 28, 2009 at 15:14 UTC
Yep. Look through WWW::Mechanize's documentation as it's a high-level module that automates web scripting. Also look at learning an HTML parser (like HTML::TreeBuilder, which uses HTML::Parser to build a tree structure out of the HTML from which you can traverse) if you need something more than a simple snippet you can get with regular expressions. Finally, if you need more assistance ask a specific question here (although make sure you try whatever you're asking for help on doing first!) and pick up a book - Perl Cookbook is pretty popular and has an entire chapter on web automation.	[reply]
Re: Obtaining Text from a website by CountZero (Bishop) on Jan 28, 2009 at 22:55 UTC
If you want to get down to the basics of obtaining web-pages, have a look at LWP::UserAgent. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]