Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

To mechanize WWW::Mechanize: a scraping language?

by johnnywang (Priest)
on Aug 25, 2004 at 19:33 UTC ( [id://385785]=perlmeditation: print w/replies, xml ) Need Help??

Before I heard about WWW::Mechanize, LWP was my favorite module set. I did lots of website scraping with it, mostly for fun (e.g., reading Yahoo Finance stock message boards in the bubble years, getting stats on eBay, etc.) Now I use WWW::Mechanize, which, although a subclass of LWP::UserAgent, is much easier. I use it mainly for testing web applications with Test::More and Test::DatabaseRow, it works great.

In my LWP days, I always wished to have a way to describe a scraping in a file, and run a general perl script to execute that description, rather than coding for each case. I never did pursue that. Recently I started thinking about it again, now armed with WWW::Mechanize.

What I'm trying to do is to be able to describe a sequence of scraping as, for example:

<mechanize> <get url="http://perlmonks.org/index.pl?node=login" output="login. +html"/> <submit form_name="login" user="" passwd="" button="login" output= +"index.html"/> <get url="http://www.perlmonks.org/index.pl?node=Newest Nodes" out +put="newest.html"/> </mechanize>
Then have a driver program to parse this and take the appropriate actions. The advantage is at least to avoid coding, and also to allow a non-perl or non-programmer to do scraping. The following is a very preliminary start (e.g., many commands hardcoded), the purpose to put it here is to first see whether something like this already exists, and to seek your advice/comments. For example, XML doesn't seem to be the right language here since scraping is not usually hierarchical, I'm using xml just to avoid doing my own parsing.

My simple driver program is as follows:

use strict; use WWW::Mechanize; use XML::SAX; use CmdHandler; if(!@ARGV){ print "Need to pass a input file name"; } my $agent = WWW::Mechanize->new; my $parser = XML::SAX::ParserFactory->parser( Handler => CmdHandler->ne +w($agent) ); $parser->parse_uri($ARGV[0]); exit(0);
Where the CmdHandler.pm is as follows:
package CmdHandler; use strict; use base qw(XML::SAX::Base); sub new{ my $class = shift; my $self = $class->SUPER::new(); $self->_init(@_); return $self; } sub _init{ my ($self,$agent) = @_; $self->{agent} = $agent; } sub start_element{ my ($self,$el) = @_; my $name = $el->{Name}; print "Processing start_element:$name\n"; return if $name eq "mechanize"; # put all attributes in a hash, is there a better way? my %params = (); foreach my $k (values %{$el->{Attributes}}){ $params{$k->{Name}}=$k->{Value}; } # well, some ugly hardcoded if-else, a better way? if($name eq "get"){ $self->{agent}->get($params{url}); }elsif($name eq 'submit'){ $self->{agent}->submit(form_name=>$params{form_name}, button=>$params{button}, fields=>\%params); }elsif($name eq 'back'){ $self->{agent}->back(); }elsif($name eq 'follow_link'){ $self->{agent}->follow_link(n => $params{n}, text=>$params{text}, url_regex=>$params{url_regex}); }else{ print "Hey, don't know what you mean, may be in next version.\ +n"; } # may be we want to print out to a file? my $file = $params{output}; if($file){ if($file eq "stdout"){ print $self->{agent}->content(); }elsif($file eq "none"){ }else{ open(OUTPUT, ">$file") or warn "Can't open $file for writi +ng\n"; print OUTPUT $self->{agent}->content(); close(OUTPUT); } } return $self->SUPER::start_element($el); } 1;

Replies are listed 'Best First'.
Re: To mechanize WWW::Mechanize: a scraping language?
by Corion (Patriarch) on Aug 25, 2004 at 19:54 UTC

    In addition to perrins recommendation of webchat, I can offer my own module, WWW::Mechanize::Shell, which is a more or less useful generator for WWW::Mechanize scripts. People are using WWW::Mechanize::Shell scripts as standalone scripts too, and simple automated extraction out of tables and batch downloads are easily possible with it too.

    What WWW::Mechanize::Shell doesn't do, and never will do, are loops and control structures, as for anything more complicated, you should use WWW::Mechanize, as driven trough Perl instead WWW::Mechanize driven through some ad-hoc shell language.

Re: To mechanize WWW::Mechanize: a scraping language?
by chromatic (Archbishop) on Aug 25, 2004 at 19:47 UTC
    The advantage is at least to avoid coding

    No, it isn't. People still have to learn the syntax and semantics of a domain-specific language to make this work.

    There may be good reasons to make a domain-specific language and there may be good reasons to use a DSL instead of Perl, but don't fool yourself into thinking that using a DSL means that people can avoid programming or that you can easily avoid syntactical and semantic mistakes from non-programmers.

    If you do pursue this line of thinking, please consider that XML is a terrible thing to inflict upon humans.

Re: To mechanize WWW::Mechanize: a scraping language?
by Jenda (Abbot) on Aug 25, 2004 at 22:22 UTC

    Why create a new language? Let's use Perl!

    You may create a module that will work as a wrapper around WWW::Mechanize. It will create the object and do all the other preliminary stuff, it will provide you with helper functions and throught AUTOLOAD it will allow you to call the $agent's methods as functions. So you end up with something like

    use WWW:Mechanize::Simple; get url => "http://perlmonks.org/index.pl?node=login", output="login.html"; submit form_name => "login", user=>"", passwd=>"", button=>"login", output=>"index.htm; get url=>"http://www.perlmonks.org/index.pl?node=Newest Nodes", output=>"newest.html";

    Aint that simple enough? And it should not be that hard to implement, you just need to initialize the object, create a few functions and use AUTOLOAD to pass the others to the $agent as methods. And that'll be it.

    Jenda
    Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
       -- Rick Osborne

      This is exactly what I did when our DBA came to me and asked me for a load tester, to determine which parameter changes were best, given a specified set of SQL statements and the order they should run in.

      ------
      We are the carpenters and bricklayers of the Information Age.

      Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

      I shouldn't have to say this, but any code, unless otherwise stated, is untested

Re: To mechanize WWW::Mechanize: a scraping language?
by perrin (Chancellor) on Aug 25, 2004 at 19:43 UTC
Re: To mechanize WWW::Mechanize: a scraping language?
by zby (Vicar) on Aug 26, 2004 at 07:38 UTC
    As an alternative for writing the scripts you might look at HTTP::Recorder for a tool for recording interactive web sessions as scripts (the language used is perl itself). There is an article on perl.com about it: Web Testing with HTTP::Recorder by Linda Julien.
Re: To mechanize WWW::Mechanize: a scraping language?
by mojotoad (Monsignor) on Aug 25, 2004 at 22:43 UTC
    You might want to check out Compaq's WebL language. Though implemented in Java, the source is available. Perhaps a perl translator between WebL and WWW::Mechanize?

    Matt

      I just had a look at it, and I must say:
        Yuck!!!
      It look's a bit like the worst of Pascal and ECMAScript rolled into one.

      Why not stay with Perl for doing thing like this? You will probably end up needing a full language to do anything serious, and then you will have to reinvent Perl!

        I agree that it's not pretty -- perl is better suited for most, if not all, of the tasks it accomplishes.

        What I had in mind was leveraging existing WeBL scripts, or translating them. Also, if you look at the problem space that WeBL addresses, it serves as a good source of sign posts for the sort of constructs you'd want for a general purpose web-weasel language.

        Matt

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://385785]
Approved by Arunbear
Front-paged by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2024-04-23 16:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found