some screenscraping help...

knewter has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I'm trying to write a screen scraper to pull a list of course offerings from a website via some form action...the first step of it works beautifully, the subsequent steps all claim that the subjects have no associated courses...clearly I've goofed something up but am having trouble fixing it.

#!/usr/bin/perl -w
use strict;
$|++;

use File::Basename;
use WWW::Mechanize 0.48;
use HTML::TokeParser;
use HTML::TreeBuilder;

my $mech = WWW::Mechanize->new();

# Get the starting page
$mech->get( "http://oasis.auburn.edu/ia-bin/tsrvweb.exe?&WID=W&tserve_
+tip_write=||WID&ConfigName=rcrssecthp1-l&ReqNum=1&TransactionSource=H
+&tserve_trans_config=rcrssecthp1-l.cfg&tserve_host_code=HostZero&tser
+ve_tiphost_code=TipZero" );
$mech->success or die $mech->response->status_line;

# Select the Term form, fill the fields, and submit
$mech->form_number( 1 );
$mech->field( Term => "2005S" );
$mech->click();

$mech->success or die "post failed: ",
  $mech->response->status_line;
  
my $parser = HTML::TreeBuilder->new_from_content( \$mech->{content} );

my @options = $parser->look_down(
                _tag => 'option'
                );
                
foreach my $option (@options) {
  my $o = $option->attr('value') ;

  $mech->form_number( 1 );
  $mech->field( Subject => $o );
  $mech->click();
  
  $mech->success or die "post failed: ",
      $mech->response->status_line;

  my $oparser = HTML::TreeBuilder->new_from_content( \$mech->{content}
+ );
  my @courses = $oparser->look_down(
                  _tag => 'option'
                  );
                  
  my $mech2 = $mech;
  foreach my $course (@courses) {
    my $c = $course->attr('value');
    
    $mech2->form_number( 1 );
    $mech2->field( CourseID => $c );
    $mech2->click;
    
    $mech2->success or die "post failed: ",
        $mech2->response->status_line;
        
    print "\t", $c, "\n";

    $mech2->back();
  }
      
  print $o, "\n";

  $mech->back();
}
[download]

Partial output below:

   ACCT-2110
   ACCT-2210
   ACCT-2910
ACCT
ADED
AERO
...

It SHOULD have one of those initial blocks above each of the main blocks (ACCT, ADED, AERO)...it does not. This is the problem.

P.S. - my apologies for having initially placed this on my scratchpad...I am a newbie, and I had seen someone request that earlier of someone, thought it was generally acceptable. Admissions of foolishness should be implied...

update!!
Right, so it looks like the problem is a session timeout now. Any hints on keeping an https session alive longer when using WWW::Mechanize?

Comment on some screenscraping help... Download Code

Replies are listed 'Best First'.
Re: some screenscraping help... by tphyahoo (Vicar) on Jan 26, 2005 at 08:49 UTC
For timing out with mechanize.... From WWW::Mechanize "WWW::Mechanize is a proper subclass of LWP::UserAgent and you can also use any of LWP::UserAgent's methods." From LWP::UserAgent: `$ua->timeout(10);` [download] ie, set timeout to 10 seconds, I think. Hope this helps! Thomas.	[reply] [d/l]
Re: some screenscraping help... by talexb (Chancellor) on Jan 25, 2005 at 20:51 UTC
The code's all up on my scratchpad .. Yes, well, that does us no good in three months time when someone wants to search for work that people have done on screen scraping, comes across this node, goes to your scratchpad and discovers not screen scraping code but .. URLs for poetry by Leonard Nimoy .. or something else. I'ts best if you actually post some code here for us to check out. Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds Update: Thanks!	[reply]
Re^2: some screenscraping help... by knewter (Novice) on Jan 25, 2005 at 20:57 UTC
many thanks on this. A couple of others have already pointed this out as well privately for me (as now that I've updated the entry your response to it suffers the same fate) and I've fixed it...	[reply]