Re^2: Web Spider problem

I took out the url, didnt know if it was a prob or not. Below is the code and the output. I fixed the problem I was having with selecting all the objects in the object box.

use strict;
use WWW::Mechanize;
use Time::Local;
use POSIX 'strftime';
use HTML::Strip;
#use Whitespace;

my($url) = 'http://www.usgbc.org/LEED/Project/RegisteredProjectList.as
+px?CMSPageID=243&CategoryID=19&';
my($pageCheck) = "";
my $mech = WWW::Mechanize->new(autocheck =>1);
my $hs = HTML::Strip->new();
my($linkName) = 'dgRegProjList$_ctl29$_ctl';
my($linkNumber) = 1;

$mech->agent_alias('Linux Mozilla');

$mech->get($url) or die "Page $url can't be reached";
print "Made it past the url test";

my($form) = $mech->forms();
my $menu = $form->find_input("lstLeedRating");
print for $menu->possible_values();

$pageCheck = $mech->click_button(name => "btnSearch");

    if($pageCheck->is_success){
             print $pageCheck->content;
             }
        else {
             print STDERR $pageCheck->status_line, "\n";
             die "Page with those fields not found!";
             }

        my $page = $mech->content;

    my $clean_text = $hs->parse( $page );

        $hs->eof;

        print $clean_text;



for($linkNumber= 2; $linkNumber <= 187; $linkNumber++)
{
        $pageCheck = $mech->click_button(name => 'dgRegProjList$_ctl29
+$_ctl'.$linkNumber);

    if($pageCheck->is_success){
             print $pageCheck->content;
             }
        else {
             print STDERR $pageCheck->status_line, "\n";
             die "Page with those fields not found!";
             }
        my $page = $mech->content;

    my $clean_text = $hs->parse( $page );

        $hs->eof;

        print $clean_text;

}
[download]

I get results

entire page
#Then the part I Want....
Project Name Owner City State Country LEED Rating System
#business under, and I was hoping to get ride of white space.
[download]

But my output what Im looking for is the business, there info seperated by commas and at end of page advance to next page. With a loop that I thought would work I get No clickable input with name dgRegProjList$_ctl29_ctl2 ... which is the name of the link on the page. Thanks again for your help

Comment on Re^2: Web Spider problem Select or Download Code

Replies are listed 'Best First'.
Re^3: Web Spider problem by rpanman (Scribe) on Jul 11, 2007 at 17:43 UTC
Okay it looks like the reason you cannot click the link is that it is handled by Javascript. WWW::Mechanize does not interpret Javascript, you will have to do this yourself. There is an (oft reffered to) thread on this which may be of some help How to handle javascript. Also, just to make you aware, you may want to look at the Terms and Conditions on the site. I had a quick peek and noticed the following: `Unless otherwise specified hereunder, You may not sell, rent, modify, +reproduce, display, distribute, redistribute, republicize, retransmit +, participate in the transfer or sale, create derivative works, or in + any way exploit or otherwise use the Site Content, in whole or in pa +rt, in any way without the respective owner's prior written consent.` [download] Pretty heavy huh? Check out the Ethics of Webbots for some perlmonks opinions.	[reply] [d/l]

Replies are listed 'Best First'.

Re^3: Web Spider problem
by rpanman (Scribe) on Jul 11, 2007 at 17:43 UTC

How to handle javascript

Unless otherwise specified hereunder, You may not sell, rent, modify, 
+reproduce, display, distribute, redistribute, republicize, retransmit
+, participate in the transfer or sale, create derivative works, or in
+ any way exploit or otherwise use the Site Content, in whole or in pa
+rt, in any way without the respective owner's prior written consent.
[download]

Ethics of Webbots

[reply]
[d/l]