Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

WWW::Mechanize Basics

by PerlSufi (Friar)
on Jun 06, 2013 at 19:49 UTC ( #1037506=perltutorial: print w/replies, xml ) Need Help??

Hello Monks, I wanted to write a basic how-to on using WWW::Mechanize that was suggested in Tutorial Quest. I will provide a basic over-view of how to log in to a website. One DON'T that I will say right off the bat to save future frustration is that WWW::Mechanize DOES NOT SUPPORT JAVASCRIPT. One of my first tasks at my job was to write a crawler that logged into a website and downloaded some account information. I will provide that portion here. Some other tools will make working with Mechanize much easier. These would be Firebug (or some other web page inspector) and HTTP Live Headers. For this project, I really only needed Firebug. You will need this to inspect what the names and values of particular parts of the website you are trying to access. One can also set the agent_alias to several different things. In this example, I did not set it. But you can do so like: $mech->agent_alias($alias);.
use WWW::Mechanize; my $mech = WWW::Mechanize->new(); my $url = ""; $mech->get($url); $mech->follow_link( url => ''); if ($mech->success()){ print "Successful Connection\n"; } else { print "Not a successful connection\n"; }
You will notice here that I just made an if statement to verify if the event was successful. There is a $mech->success function which is very useful for knowing if it went through OK. It is good practice from what I have learned so far to give yourself some kind of verification that what you did worked. This can also be done by putting:
print $mech->content;
The mech->dump_* functions are very useful for debugging or finding out more things about the page you have accessed last. Use them frequently. There is a dump_forms, dump_text, dump_links, etc.. The next part I had to do was enter username/password, start/end date for the report I wanted to receive. I did it with the following:
#This block of code is intended to fill in the required forms $mech->g +et(""); my $usr = "username"; my $pw = "password"; $mech->form_number(1); $mech->field( "capsn", $usr); $mech->form_number(2); $mech->field("capsp", $pw); $mech->form_number(3); $mech->field( "startdate", $start_date); $mech->form_number(4); $mech->field( "enddate", $end_date); $mech->click();
Here I had to inspect the page with Firebug and find the name of each of the fields (in quotes in my script) and set their value to the variable I declared. The 'click' method did not need the button name specified, though you may have to do that some times. Yes, this site used SSL, and no, I did not need to do anything special to login to it this time. However, I have had to crawl another website using SSL, which I did need to do something special with. This is what I had to do:
use WWW::Mechanize; use IO::Socket::SSL qw(); my $mech = WWW::Mechanize->new(ssl_opts => { SSL_verify_mode => IO::So +cket::SSL::SSL_VERIFY_NONE, verify_hostname => 0,});
In this method, I set it to not verify SSL. Actually, the start and end dates were acquired with a little bit more work using a different module, DateTime. I can get into that later. Newbies to this module should keep in mind that Mechanize DOES NOT interpret javascript. The only way around this that I have found so far is to use HTTP Live Headers to inspect what the HTTP is doing as you navigate through the site. Where there is GET, use $mech->get($url) Where there is a POST, use $mech->post('$url') I have successfully navigated a javascript heavy web page using this method, but it is extremely tedious. If you have a CHOICE, use WWW::Mechanize::Firefox, WWW::Selenium, or some other module that interprets javascript.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perltutorial [id://1037506]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2022-09-24 17:30 GMT
Find Nodes?
    Voting Booth?
    I prefer my indexes to start at:

    Results (115 votes). Check out past polls.