Links for Screen Scraping

Agyeya has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Links for Screen Scraping by Corion (Patriarch) on May 26, 2004 at 09:02 UTC
WWW::Mechanize is a subclass of LWP::UserAgent. All methods of LWP::UserAgent are available with WWW::Mechanize too, so posting works the same as with LWP::UserAgent. In most cases, the POST request is done by using the WWW::Mechanize `click` method, or the WWW::Mechanize `submit` method (if there is no button to click. In most cases, reading the WWW::Mechanize documentation should prove helpfull when looking for interesting methods. For some more specialized scraping modules, take a look at the WWW::Search:: module namespace and the Finance::Banking:: module namespace. I haven't found any problems in using WWW::Mechanize for all my scraping needs, together with HTML::TableExtract to scrape stuff out of HTML tables afterwards. Unless you implement a DOM, you will have to interpret the Javascript on the pages yourself and convert the Javascript to Perl code manually.	[reply]
Re: Links for Screen Scraping by Abigail-II (Bishop) on May 26, 2004 at 10:52 UTC
Sure, you can call the inherited post(), but that doesn't give you the benefits of WWW::Mechanize. WWW::Mechanize keeps your current page, so you can follow links, fill out forms, etc. By calling an inherited function, you cut out the valuable middle man. The reason WWW::Mechanize doesn't have (or need) a post method is that browsers also don't have functionality for users to type in their POST requests. POST requests are typically done by submitting forms - and WWW::Mechanize has good functionality for that. Abigail	[reply]
Re: Links for Screen Scraping by eyepopslikeamosquito (Archbishop) on May 26, 2004 at 08:57 UTC
spidering hacks book is good.	[reply]
Re: Links for Screen Scraping by saskaqueer (Friar) on May 26, 2004 at 09:25 UTC
Take a look at the HTML Parser modules on CPAN. In the past I have successfully used HTML::Parser to extract data, links, etc. It is actually very easy to use once you understand it enough. Below I have included a simple module that will return all text snippets found within a document. This would not be hard to modify to permit using the POST method. As for your javascript problem, I am not sure what you mean by "javascript which has to be parsed through, in order to reach the final page". package ExtractText; use strict; use Exporter (); use HTML::Parser; use LWP::UserAgent; use Carp qw( croak ); our ( @ISA ) = qw( Exporter ); our ( @EXPORT_OK ) = qw( extract_text ); sub extract_text { shift( @_ ) if ( $_[0] eq 'ExtractText' ); my $uri = shift; die( "Single parameter to extract_links() must be a URI to process +" ) unless ( defined( $uri ) ); my $ua = LWP::UserAgent->new(); my $res = $ua->get( $uri ); croak( "Fetch of '$uri' failed: ", $res->status_line() ) unless( $res->code() == 200 ); my $parser = HTML::Parser->new( text_h => [ \&_parser_text, 'self,dtext,is_cdata' ] ); $parser->parse( $res->content() ); return( @{ $parser->{_extracted} } ); } sub _parser_text { my ($self, $dtext, $is_cdata) = @_; push( @{ $self->{_extracted} }, $dtext ) unless ( $is_cdata ); } 1; [download] Example usage: `#!/usr/bin/perl -w use strict; use ExtractText qw( extract_text ); my @text_snippets = extract_text( 'http://www.perlmonks.org' ); print @text_snippets;` [download]	[reply] [d/l] [select]
Re: Links for Screen Scraping by Fletch (Bishop) on May 26, 2004 at 13:29 UTC
Perl & LWP.	[reply]
Re: Links for Screen Scraping by perrin (Chancellor) on May 26, 2004 at 18:50 UTC
Others answered your POST question already. As for the JavaScript, you need to write some Perl code that does the equivalent work to figure out which URL to request. This has been discussed at great length on this site, as you will see if you SuperSearch for JavaScript.	[reply]