Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to archive the wiring harness info for each Motorcycle type at http://www.datatool.co.uk/bikes1.asp.

My problem is that I can't get the indexing of the arrays to cycle through each model of a certain manufacturer of a motorcycle.

First I store the Manufacturer of each motorcycle in an array, that works.

Then for each Manufacturer of motorcycle, I go to that Manufacturers page and store the <option value=X> for each motorcycle where X stands for a certain model.

Then I go to that models page and do a HTML::TokeParser->get_text('/table') to store the table data.

The Problem I'm having is I can't seem to be able to index through each Model/Manufacturer correctly to access each wiring harness page. What's wrong with my code? I know I'm close.

The end result is I want to store all the wiring harness info into a CSV file.

Here's my code:
#!/usr/bin/perl -w use LWP::UserAgent; use HTML::TokeParser; use Data::Dumper; my $url = 'http://www.datatool.co.uk/bikes1.asp'; my $url2 = 'http://www.datatool.co.uk/bikes2.asp'; my $browser = LWP::UserAgent->new(); my $response = $browser->get($url); die "Error getting $url: ", $resp->status_line unless $response->is_success; die "It's not HTML, it's ", $resp->content-type unless $response->content_type eq 'text/html'; my $html = $response->content; open(DAT,'>',"c:\\cas.txt") || die("Cannot Open File"); my $stream = HTML::TokeParser->new( \$html ) || die "Couldn't read HTML string: $!"; my @manufact; my @models; while ( my $token = $stream->get_token ) { if ($token->[0] eq 'S' and $token->[1] eq 'option' and $token->[2]{'v +alue'} ne ''){ push(@manufact, $token->[2]{'value'}); } } #print Dumper @manufact; #sleep 5; my $i=0; foreach(@manufact){ $response = $browser->post( $url, [ 'Manufacturer' => "$manufact[$i]", 'btnSearch' => 'Search for matching Models' ] ); die "Error getting $url: ", $resp->status_line unless $response->is_success; die "It's not HTML, it's ", $resp->content-type unless $response->content_type eq 'text/html'; $html = $response->content; # print $html; $stream = HTML::TokeParser->new( \$html ) || die "Couldn't read HTML string: $!"; while ( $token = $stream->get_token ) { if ($token->[0] eq 'S' and $token->[1] eq 'option' and $token->[2] +{'value'} ne '' and $token->[2]{'value'} lt 'A') { push(@models, $token->[2]{'value'}); } } my $x=0; while (@models){ $response = $browser->post( $url2, [ 'Manufacturer' => "$manufact[$i]", 'Model' => "$models[$x]", 'btnSearch' => 'Search for matching Models' ] ); die "Error getting $url2: ", $resp->status_line unless $response->is_success; die "It's not HTML, it's ", $resp->content-type unless $response->content_type eq 'text/html'; $x++; $html = $response->content; $stream = HTML::TokeParser->new( \$html ) || die "Couldn't read HTML string: $!"; my $text = $stream->get_text('/table'); print $text; print DAT $text; } } close(DAT);

Replies are listed 'Best First'.
Re: Trying To Archive Web Info With My Buggy Code??
by davidrw (Prior) on Aug 05, 2005 at 19:07 UTC
    first, make sure you use strict; in your code. Adding that and doing perl -c immediately finds $resp->content-type -- it should be $response->content_type (also $resp->status_line should be $response->status_line; and one of the while/$token loops needs a 'my')

    Also, using WWW::Mechanize might simplify things for you ...

    /me goes off to look at actual code now to try to answer primary question ..
      here's a rewrite w/WWW::Mechanize:
      #!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; my $url = 'http://www.datatool.co.uk/bikes1.asp'; my $url2 = 'http://www.datatool.co.uk/bikes2.asp'; open(DAT,'>',"c:\\cas.txt") || die("Cannot Open File"); my $mech = WWW::Mechanize->new(); $mech->get($url); $mech->form(1); my @manufacturers = grep $_, $mech->current_form->find_input( "Manufac +turer" )->value_names; foreach my $manufacturer ( @manufacturers ){ $mech->get($url); $mech->submit_form( form_number => 1, fields => { Manufacturer => $manufacturer }, ); $mech->form(1); my @models = grep $_ !~ /Select Model/, $mech->current_form->find_in +put( "Model" )->value_names; foreach my $model ( @models ){ my $mech2 = $mech->clone; $mech2->submit_form( form_number => 1, fields => { Model => $model }, ); my $html = $mech2->content; ###################################################### ######## This part is left as excerise to reader ##### my $stream = HTML::TokeParser->new( \$html ) || die "Couldn't read + HTML string: $!"; my $text = $stream->get_text('/table'); print DAT $text; ###################################################### } } close DAT;