Trying To Archive Web Info With My Buggy Code??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to archive the wiring harness info for each Motorcycle type at http://www.datatool.co.uk/bikes1.asp.

My problem is that I can't get the indexing of the arrays to cycle through each model of a certain manufacturer of a motorcycle.

First I store the Manufacturer of each motorcycle in an array, that works.

Then for each Manufacturer of motorcycle, I go to that Manufacturers page and store the <option value=X> for each motorcycle where X stands for a certain model.

Then I go to that models page and do a HTML::TokeParser->get_text('/table') to store the table data.

The Problem I'm having is I can't seem to be able to index through each Model/Manufacturer correctly to access each wiring harness page. What's wrong with my code? I know I'm close.

The end result is I want to store all the wiring harness info into a CSV file.

Here's my code:

#!/usr/bin/perl -w

use LWP::UserAgent;
use HTML::TokeParser;
use Data::Dumper;

my $url = 'http://www.datatool.co.uk/bikes1.asp';
my $url2 = 'http://www.datatool.co.uk/bikes2.asp';

my $browser = LWP::UserAgent->new();

my $response = $browser->get($url);
die "Error getting $url: ", $resp->status_line
  unless $response->is_success;
die "It's not HTML, it's ", $resp->content-type
  unless $response->content_type eq 'text/html';

my $html = $response->content;

open(DAT,'>',"c:\\cas.txt") || die("Cannot Open File");

my $stream = HTML::TokeParser->new( \$html )
  || die "Couldn't read HTML string: $!";

my @manufact;
my @models;

while ( my $token = $stream->get_token ) {

 if ($token->[0] eq 'S' and $token->[1] eq 'option' and $token->[2]{'v
+alue'} ne ''){
  push(@manufact, $token->[2]{'value'});
 }
}
#print Dumper @manufact;
#sleep 5;

my $i=0;
foreach(@manufact){
  $response = $browser->post(
    $url,
    [
     'Manufacturer' => "$manufact[$i]",
     'btnSearch' => 'Search for matching Models'
    ]
    );
  die "Error getting $url: ", $resp->status_line
    unless $response->is_success;
  die "It's not HTML, it's ", $resp->content-type
    unless $response->content_type eq 'text/html';


  $html = $response->content;
#  print $html;

  $stream = HTML::TokeParser->new( \$html )
      || die "Couldn't read HTML string: $!";

  while ( $token = $stream->get_token ) {
    if ($token->[0] eq 'S' and $token->[1] eq 'option' and $token->[2]
+{'value'} ne '' and $token->[2]{'value'} lt 'A') {
    push(@models, $token->[2]{'value'});
    }
  }

  my $x=0;
  while (@models){
    $response = $browser->post(
    $url2,
    [
      'Manufacturer' => "$manufact[$i]",
      'Model' => "$models[$x]",
      'btnSearch' => 'Search for matching Models'
    ]
    );
    die "Error getting $url2: ", $resp->status_line
      unless $response->is_success;
    die "It's not HTML, it's ", $resp->content-type
      unless $response->content_type eq 'text/html';

    $x++;
    $html = $response->content;
    $stream = HTML::TokeParser->new( \$html )
      || die "Couldn't read HTML string: $!";

    my $text = $stream->get_text('/table');
    print $text;
    print DAT $text;
    }
  }

close(DAT);
[download]

Comment on Trying To Archive Web Info With My Buggy Code?? Download Code

Replies are listed 'Best First'.
Re: Trying To Archive Web Info With My Buggy Code?? by davidrw (Prior) on Aug 05, 2005 at 19:07 UTC
first, make sure you `use strict;` in your code. Adding that and doing `perl -c` immediately finds `$resp->content-type` -- it should be `$response->content_type` (also `$resp->status_line` should be `$response->status_line`; and one of the while/$token loops needs a 'my') Also, using WWW::Mechanize might simplify things for you ... /me goes off to look at actual code now to try to answer primary question ..	[reply] [d/l] [select]
Re^2: Trying To Archive Web Info With My Buggy Code?? by davidrw (Prior) on Aug 05, 2005 at 19:38 UTC
here's a rewrite w/WWW::Mechanize: #!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; my $url = 'http://www.datatool.co.uk/bikes1.asp'; my $url2 = 'http://www.datatool.co.uk/bikes2.asp'; open(DAT,'>',"c:\\cas.txt") \|\| die("Cannot Open File"); my $mech = WWW::Mechanize->new(); $mech->get($url); $mech->form(1); my @manufacturers = grep $_, $mech->current_form->find_input( "Manufac +turer" )->value_names; foreach my $manufacturer ( @manufacturers ){ $mech->get($url); $mech->submit_form( form_number => 1, fields => { Manufacturer => $manufacturer }, ); $mech->form(1); my @models = grep $_ !~ /Select Model/, $mech->current_form->find_in +put( "Model" )->value_names; foreach my $model ( @models ){ my $mech2 = $mech->clone; $mech2->submit_form( form_number => 1, fields => { Model => $model }, ); my $html = $mech2->content; ###################################################### ######## This part is left as excerise to reader ##### my $stream = HTML::TokeParser->new( \$html ) \|\| die "Couldn't read + HTML string: $!"; my $text = $stream->get_text('/table'); print DAT $text; ###################################################### } } close DAT; [download]	[reply] [d/l]