Agyeya has asked for the wisdom of the Perl Monks concerning the following question:

I would like to ask my fellow monks for guidance. I am currently doing a college assignment on screen scraping. Please suggest some books and/or other online resources for finding information on Screen Scraping.

Note:I was using WWW::Mechanize, but was unable to find information on how to execute a POST method through it. Also the pages I want to scrape do contain some javascript which has to be parsed through, in order to reach the final page.
This node could serve as a one point collection of links to resources for screen scraping using perl.

Replies are listed 'Best First'.
Re: Links for Screen Scraping
by Corion (Patriarch) on May 26, 2004 at 09:02 UTC

    WWW::Mechanize is a subclass of LWP::UserAgent. All methods of LWP::UserAgent are available with WWW::Mechanize too, so posting works the same as with LWP::UserAgent. In most cases, the POST request is done by using the WWW::Mechanize click method, or the WWW::Mechanize submit method (if there is no button to click. In most cases, reading the WWW::Mechanize documentation should prove helpfull when looking for interesting methods.

    For some more specialized scraping modules, take a look at the WWW::Search:: module namespace and the Finance::Banking:: module namespace.

    I haven't found any problems in using WWW::Mechanize for all my scraping needs, together with HTML::TableExtract to scrape stuff out of HTML tables afterwards.

    Unless you implement a DOM, you will have to interpret the Javascript on the pages yourself and convert the Javascript to Perl code manually.

      Sure, you can call the inherited post(), but that doesn't give you the benefits of WWW::Mechanize. WWW::Mechanize keeps your current page, so you can follow links, fill out forms, etc. By calling an inherited function, you cut out the valuable middle man.

      The reason WWW::Mechanize doesn't have (or need) a post method is that browsers also don't have functionality for users to type in their POST requests. POST requests are typically done by submitting forms - and WWW::Mechanize has good functionality for that.

      Abigail

Re: Links for Screen Scraping
by eyepopslikeamosquito (Archbishop) on May 26, 2004 at 08:57 UTC
Re: Links for Screen Scraping
by saskaqueer (Friar) on May 26, 2004 at 09:25 UTC

    Take a look at the HTML Parser modules on CPAN. In the past I have successfully used HTML::Parser to extract data, links, etc. It is actually very easy to use once you understand it enough. Below I have included a simple module that will return all text snippets found within a document. This would not be hard to modify to permit using the POST method. As for your javascript problem, I am not sure what you mean by "javascript which has to be parsed through, in order to reach the final page".

    package ExtractText; use strict; use Exporter (); use HTML::Parser; use LWP::UserAgent; use Carp qw( croak ); our ( @ISA ) = qw( Exporter ); our ( @EXPORT_OK ) = qw( extract_text ); sub extract_text { shift( @_ ) if ( $_[0] eq 'ExtractText' ); my $uri = shift; die( "Single parameter to extract_links() must be a URI to process +" ) unless ( defined( $uri ) ); my $ua = LWP::UserAgent->new(); my $res = $ua->get( $uri ); croak( "Fetch of '$uri' failed: ", $res->status_line() ) unless( $res->code() == 200 ); my $parser = HTML::Parser->new( text_h => [ \&_parser_text, 'self,dtext,is_cdata' ] ); $parser->parse( $res->content() ); return( @{ $parser->{_extracted} } ); } sub _parser_text { my ($self, $dtext, $is_cdata) = @_; push( @{ $self->{_extracted} }, $dtext ) unless ( $is_cdata ); } 1;

    Example usage:

    #!/usr/bin/perl -w use strict; use ExtractText qw( extract_text ); my @text_snippets = extract_text( 'http://www.perlmonks.org' ); print @text_snippets;
Re: Links for Screen Scraping
by Fletch (Bishop) on May 26, 2004 at 13:29 UTC
Re: Links for Screen Scraping
by perrin (Chancellor) on May 26, 2004 at 18:50 UTC
    Others answered your POST question already. As for the JavaScript, you need to write some Perl code that does the equivalent work to figure out which URL to request. This has been discussed at great length on this site, as you will see if you SuperSearch for JavaScript.