in reply to Web Spider problem

Hi bauer1sc

When I run the code above I cannot access www.blankurl.biz, it does not appear to exist. I think it will be difficule to assess the problems you are having with the form without being able to visit the web page...

As far as the stripping of HTML goes, it works fine on Google using the code below:
use strict; use WWW::Mechanize; use HTML::Strip; my($url) = 'http://www.google.co.uk'; my $mech = WWW::Mechanize->new(autocheck =>1); my $hs = HTML::Strip->new(); $mech->agent_alias('Linux Mozilla'); $mech->get($url) or die "Page $url can't be reached"; print "Made it past the url test"; my $page = $mech->content; my $clean_text = $hs->parse( $page ); $hs->eof; print $clean_text;
The output is:
Made it past the url test iGoogle | Sign in Web Images News Maps New! Produc +ts Groups Scholar more » Advanced Search Prefe +rences Language Tools Search: the web pages from the UK Advertis +ing Programmes - Business Solutions - About Google - Go to Google.com + ©2007 Google
Which is the page contents with all the HTML stripped out. What are you expecting?

Replies are listed 'Best First'.
Re^2: Web Spider problem
by bauer1sc (Initiate) on Jul 11, 2007 at 16:48 UTC
    I took out the url, didnt know if it was a prob or not. Below is the code and the output. I fixed the problem I was having with selecting all the objects in the object box.
    use strict; use WWW::Mechanize; use Time::Local; use POSIX 'strftime'; use HTML::Strip; #use Whitespace; my($url) = 'http://www.usgbc.org/LEED/Project/RegisteredProjectList.as +px?CMSPageID=243&CategoryID=19&'; my($pageCheck) = ""; my $mech = WWW::Mechanize->new(autocheck =>1); my $hs = HTML::Strip->new(); my($linkName) = 'dgRegProjList$_ctl29$_ctl'; my($linkNumber) = 1; $mech->agent_alias('Linux Mozilla'); $mech->get($url) or die "Page $url can't be reached"; print "Made it past the url test"; my($form) = $mech->forms(); my $menu = $form->find_input("lstLeedRating"); print for $menu->possible_values(); $pageCheck = $mech->click_button(name => "btnSearch"); if($pageCheck->is_success){ print $pageCheck->content; } else { print STDERR $pageCheck->status_line, "\n"; die "Page with those fields not found!"; } my $page = $mech->content; my $clean_text = $hs->parse( $page ); $hs->eof; print $clean_text; for($linkNumber= 2; $linkNumber <= 187; $linkNumber++) { $pageCheck = $mech->click_button(name => 'dgRegProjList$_ctl29 +$_ctl'.$linkNumber); if($pageCheck->is_success){ print $pageCheck->content; } else { print STDERR $pageCheck->status_line, "\n"; die "Page with those fields not found!"; } my $page = $mech->content; my $clean_text = $hs->parse( $page ); $hs->eof; print $clean_text; }
    I get results
    entire page #Then the part I Want.... Project Name Owner City State Country LEED Rating System #business under, and I was hoping to get ride of white space.
    But my output what Im looking for is the business, there info seperated by commas and at end of page advance to next page. With a loop that I thought would work I get No clickable input with name dgRegProjList$_ctl29_ctl2 ... which is the name of the link on the page. Thanks again for your help
      Okay it looks like the reason you cannot click the link is that it is handled by Javascript. WWW::Mechanize does not interpret Javascript, you will have to do this yourself.

      There is an (oft reffered to) thread on this which may be of some help How to handle javascript.

      Also, just to make you aware, you may want to look at the Terms and Conditions on the site. I had a quick peek and noticed the following:
      Unless otherwise specified hereunder, You may not sell, rent, modify, +reproduce, display, distribute, redistribute, republicize, retransmit +, participate in the transfer or sale, create derivative works, or in + any way exploit or otherwise use the Site Content, in whole or in pa +rt, in any way without the respective owner's prior written consent.
      Pretty heavy huh?

      Check out the Ethics of Webbots for some perlmonks opinions.