bauer1sc has asked for the wisdom of the Perl Monks concerning the following question:

Im new to perl and trying to create a webspider to pull specifics. The first problem I am having is with selecting all the options of an option select box. For the time being I just slected the first entry and I am able to pull the page contents. I then Striped the HTML tags off but cant seem to figure out how to split the information I dont need off the page and get ride of the white space. Here is my code
use strict; use WWW::Mechanize; use Time::Local; use POSIX 'strftime'; use HTML::Strip; #use Whitespace; my($url) = 'http://www.blankurl.biz'; my($pageCheck) = ""; my $mech = WWW::Mechanize->new(autocheck =>1); my $hs = HTML::Strip->new(); $mech->agent_alias('Linux Mozilla'); $mech->get($url) or die "Page $url can't be reached"; print "Made it past the url test"; my($form) = $mech->forms(); $mech->field("lstLeedRating","5, 8"); $pageCheck = $mech->click_button(name => "btnSearch"); if($pageCheck->is_success){ print $pageCheck->content; } else { print STDERR $pageCheck->status_line, "\n"; die "Page with those fields not found!"; } my $page = $mech->content; my $clean_text = $hs->parse( $page ); $hs->eof; print $clean_text; open(FH, ">test.txt"); close(FH);
I greatly appreciate any and all help anyone may have to offer. Thanks again!

Replies are listed 'Best First'.
Re: Web Spider problem
by rpanman (Scribe) on Jul 11, 2007 at 15:51 UTC
    Hi bauer1sc

    When I run the code above I cannot access www.blankurl.biz, it does not appear to exist. I think it will be difficule to assess the problems you are having with the form without being able to visit the web page...

    As far as the stripping of HTML goes, it works fine on Google using the code below:
    use strict; use WWW::Mechanize; use HTML::Strip; my($url) = 'http://www.google.co.uk'; my $mech = WWW::Mechanize->new(autocheck =>1); my $hs = HTML::Strip->new(); $mech->agent_alias('Linux Mozilla'); $mech->get($url) or die "Page $url can't be reached"; print "Made it past the url test"; my $page = $mech->content; my $clean_text = $hs->parse( $page ); $hs->eof; print $clean_text;
    The output is:
    Made it past the url test iGoogle | Sign in Web Images News Maps New! Produc +ts Groups Scholar more » Advanced Search Prefe +rences Language Tools Search: the web pages from the UK Advertis +ing Programmes - Business Solutions - About Google - Go to Google.com + ©2007 Google
    Which is the page contents with all the HTML stripped out. What are you expecting?
      I took out the url, didnt know if it was a prob or not. Below is the code and the output. I fixed the problem I was having with selecting all the objects in the object box.
      use strict; use WWW::Mechanize; use Time::Local; use POSIX 'strftime'; use HTML::Strip; #use Whitespace; my($url) = 'http://www.usgbc.org/LEED/Project/RegisteredProjectList.as +px?CMSPageID=243&CategoryID=19&'; my($pageCheck) = ""; my $mech = WWW::Mechanize->new(autocheck =>1); my $hs = HTML::Strip->new(); my($linkName) = 'dgRegProjList$_ctl29$_ctl'; my($linkNumber) = 1; $mech->agent_alias('Linux Mozilla'); $mech->get($url) or die "Page $url can't be reached"; print "Made it past the url test"; my($form) = $mech->forms(); my $menu = $form->find_input("lstLeedRating"); print for $menu->possible_values(); $pageCheck = $mech->click_button(name => "btnSearch"); if($pageCheck->is_success){ print $pageCheck->content; } else { print STDERR $pageCheck->status_line, "\n"; die "Page with those fields not found!"; } my $page = $mech->content; my $clean_text = $hs->parse( $page ); $hs->eof; print $clean_text; for($linkNumber= 2; $linkNumber <= 187; $linkNumber++) { $pageCheck = $mech->click_button(name => 'dgRegProjList$_ctl29 +$_ctl'.$linkNumber); if($pageCheck->is_success){ print $pageCheck->content; } else { print STDERR $pageCheck->status_line, "\n"; die "Page with those fields not found!"; } my $page = $mech->content; my $clean_text = $hs->parse( $page ); $hs->eof; print $clean_text; }
      I get results
      entire page #Then the part I Want.... Project Name Owner City State Country LEED Rating System #business under, and I was hoping to get ride of white space.
      But my output what Im looking for is the business, there info seperated by commas and at end of page advance to next page. With a loop that I thought would work I get No clickable input with name dgRegProjList$_ctl29_ctl2 ... which is the name of the link on the page. Thanks again for your help
        Okay it looks like the reason you cannot click the link is that it is handled by Javascript. WWW::Mechanize does not interpret Javascript, you will have to do this yourself.

        There is an (oft reffered to) thread on this which may be of some help How to handle javascript.

        Also, just to make you aware, you may want to look at the Terms and Conditions on the site. I had a quick peek and noticed the following:
        Unless otherwise specified hereunder, You may not sell, rent, modify, +reproduce, display, distribute, redistribute, republicize, retransmit +, participate in the transfer or sale, create derivative works, or in + any way exploit or otherwise use the Site Content, in whole or in pa +rt, in any way without the respective owner's prior written consent.
        Pretty heavy huh?

        Check out the Ethics of Webbots for some perlmonks opinions.