Re: Web Spider problem

Hi bauer1sc

When I run the code above I cannot access www.blankurl.biz, it does not appear to exist. I think it will be difficule to assess the problems you are having with the form without being able to visit the web page...

As far as the stripping of HTML goes, it works fine on Google using the code below:

use strict;
use WWW::Mechanize;
use HTML::Strip;

my($url) = 'http://www.google.co.uk';
my $mech = WWW::Mechanize->new(autocheck =>1);
my $hs = HTML::Strip->new();

$mech->agent_alias('Linux Mozilla');

$mech->get($url) or die "Page $url can't be reached";
print "Made it past the url test";
my $page = $mech->content;
my $clean_text = $hs->parse( $page );
$hs->eof;
print $clean_text;
[download]

The output is:

Made it past the url test

iGoogle | Sign in Web      Images      News      Maps New!      Produc
+ts      Groups      Scholar      more ť      Advanced Search    Prefe
+rences    Language Tools Search:  the web  pages from the UK Advertis
+ing Programmes - Business Solutions - About Google - Go to Google.com
+ Š2007 Google
[download]

Which is the page contents with all the HTML stripped out. What are you expecting?

Comment on Re: Web Spider problem Select or Download Code

Replies are listed 'Best First'.
Re^2: Web Spider problem by bauer1sc (Initiate) on Jul 11, 2007 at 16:48 UTC
I took out the url, didnt know if it was a prob or not. Below is the code and the output. I fixed the problem I was having with selecting all the objects in the object box. use strict; use WWW::Mechanize; use Time::Local; use POSIX 'strftime'; use HTML::Strip; #use Whitespace; my($url) = 'http://www.usgbc.org/LEED/Project/RegisteredProjectList.as +px?CMSPageID=243&CategoryID=19&'; my($pageCheck) = ""; my $mech = WWW::Mechanize->new(autocheck =>1); my $hs = HTML::Strip->new(); my($linkName) = 'dgRegProjList$_ctl29$_ctl'; my($linkNumber) = 1; $mech->agent_alias('Linux Mozilla'); $mech->get($url) or die "Page $url can't be reached"; print "Made it past the url test"; my($form) = $mech->forms(); my $menu = $form->find_input("lstLeedRating"); print for $menu->possible_values(); $pageCheck = $mech->click_button(name => "btnSearch"); if($pageCheck->is_success){ print $pageCheck->content; } else { print STDERR $pageCheck->status_line, "\n"; die "Page with those fields not found!"; } my $page = $mech->content; my $clean_text = $hs->parse( $page ); $hs->eof; print $clean_text; for($linkNumber= 2; $linkNumber <= 187; $linkNumber++) { $pageCheck = $mech->click_button(name => 'dgRegProjList$_ctl29 +$_ctl'.$linkNumber); if($pageCheck->is_success){ print $pageCheck->content; } else { print STDERR $pageCheck->status_line, "\n"; die "Page with those fields not found!"; } my $page = $mech->content; my $clean_text = $hs->parse( $page ); $hs->eof; print $clean_text; } [download] I get results `entire page #Then the part I Want.... Project Name Owner City State Country LEED Rating System #business under, and I was hoping to get ride of white space.` [download] But my output what Im looking for is the business, there info seperated by commas and at end of page advance to next page. With a loop that I thought would work I get `No clickable input with name dgRegProjList$_ctl29_ctl2 ... which is the name of the link on the page.` Thanks again for your help	[reply] [d/l] [select]
Re^3: Web Spider problem by rpanman (Scribe) on Jul 11, 2007 at 17:43 UTC
Okay it looks like the reason you cannot click the link is that it is handled by Javascript. WWW::Mechanize does not interpret Javascript, you will have to do this yourself. There is an (oft reffered to) thread on this which may be of some help How to handle javascript. Also, just to make you aware, you may want to look at the Terms and Conditions on the site. I had a quick peek and noticed the following: `Unless otherwise specified hereunder, You may not sell, rent, modify, +reproduce, display, distribute, redistribute, republicize, retransmit +, participate in the transfer or sale, create derivative works, or in + any way exploit or otherwise use the Site Content, in whole or in pa +rt, in any way without the respective owner's prior written consent.` [download] Pretty heavy huh? Check out the Ethics of Webbots for some perlmonks opinions.	[reply] [d/l]