Before I heard about WWW::Mechanize, LWP was my favorite module set. I did lots of website scraping with it, mostly for fun (e.g., reading Yahoo Finance stock message boards in the bubble years, getting stats on eBay, etc.) Now I use WWW::Mechanize, which, although a subclass of LWP::UserAgent, is much easier. I use it mainly for testing web applications with Test::More and Test::DatabaseRow, it works great.

In my LWP days, I always wished to have a way to describe a scraping in a file, and run a general perl script to execute that description, rather than coding for each case. I never did pursue that. Recently I started thinking about it again, now armed with WWW::Mechanize.

What I'm trying to do is to be able to describe a sequence of scraping as, for example:

<mechanize> <get url="http://perlmonks.org/index.pl?node=login" output="login. +html"/> <submit form_name="login" user="" passwd="" button="login" output= +"index.html"/> <get url="http://www.perlmonks.org/index.pl?node=Newest Nodes" out +put="newest.html"/> </mechanize>
Then have a driver program to parse this and take the appropriate actions. The advantage is at least to avoid coding, and also to allow a non-perl or non-programmer to do scraping. The following is a very preliminary start (e.g., many commands hardcoded), the purpose to put it here is to first see whether something like this already exists, and to seek your advice/comments. For example, XML doesn't seem to be the right language here since scraping is not usually hierarchical, I'm using xml just to avoid doing my own parsing.

My simple driver program is as follows:

use strict; use WWW::Mechanize; use XML::SAX; use CmdHandler; if(!@ARGV){ print "Need to pass a input file name"; } my $agent = WWW::Mechanize->new; my $parser = XML::SAX::ParserFactory->parser( Handler => CmdHandler->ne +w($agent) ); $parser->parse_uri($ARGV[0]); exit(0);
Where the CmdHandler.pm is as follows:
package CmdHandler; use strict; use base qw(XML::SAX::Base); sub new{ my $class = shift; my $self = $class->SUPER::new(); $self->_init(@_); return $self; } sub _init{ my ($self,$agent) = @_; $self->{agent} = $agent; } sub start_element{ my ($self,$el) = @_; my $name = $el->{Name}; print "Processing start_element:$name\n"; return if $name eq "mechanize"; # put all attributes in a hash, is there a better way? my %params = (); foreach my $k (values %{$el->{Attributes}}){ $params{$k->{Name}}=$k->{Value}; } # well, some ugly hardcoded if-else, a better way? if($name eq "get"){ $self->{agent}->get($params{url}); }elsif($name eq 'submit'){ $self->{agent}->submit(form_name=>$params{form_name}, button=>$params{button}, fields=>\%params); }elsif($name eq 'back'){ $self->{agent}->back(); }elsif($name eq 'follow_link'){ $self->{agent}->follow_link(n => $params{n}, text=>$params{text}, url_regex=>$params{url_regex}); }else{ print "Hey, don't know what you mean, may be in next version.\ +n"; } # may be we want to print out to a file? my $file = $params{output}; if($file){ if($file eq "stdout"){ print $self->{agent}->content(); }elsif($file eq "none"){ }else{ open(OUTPUT, ">$file") or warn "Can't open $file for writi +ng\n"; print OUTPUT $self->{agent}->content(); close(OUTPUT); } } return $self->SUPER::start_element($el); } 1;

In reply to To mechanize WWW::Mechanize: a scraping language? by johnnywang

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.