Dear Monks,
I'm trying to write a screen scraper to pull a list of course offerings from a website via some form action...the first step of it works beautifully, the subsequent steps all claim that the subjects have no associated courses...clearly I've goofed something up but am having trouble fixing it.
#!/usr/bin/perl -w use strict; $|++; use File::Basename; use WWW::Mechanize 0.48; use HTML::TokeParser; use HTML::TreeBuilder; my $mech = WWW::Mechanize->new(); # Get the starting page $mech->get( "http://oasis.auburn.edu/ia-bin/tsrvweb.exe?&WID=W&tserve_ +tip_write=||WID&ConfigName=rcrssecthp1-l&ReqNum=1&TransactionSource=H +&tserve_trans_config=rcrssecthp1-l.cfg&tserve_host_code=HostZero&tser +ve_tiphost_code=TipZero" ); $mech->success or die $mech->response->status_line; # Select the Term form, fill the fields, and submit $mech->form_number( 1 ); $mech->field( Term => "2005S" ); $mech->click(); $mech->success or die "post failed: ", $mech->response->status_line; my $parser = HTML::TreeBuilder->new_from_content( \$mech->{content} ); my @options = $parser->look_down( _tag => 'option' ); foreach my $option (@options) { my $o = $option->attr('value') ; $mech->form_number( 1 ); $mech->field( Subject => $o ); $mech->click(); $mech->success or die "post failed: ", $mech->response->status_line; my $oparser = HTML::TreeBuilder->new_from_content( \$mech->{content} + ); my @courses = $oparser->look_down( _tag => 'option' ); my $mech2 = $mech; foreach my $course (@courses) { my $c = $course->attr('value'); $mech2->form_number( 1 ); $mech2->field( CourseID => $c ); $mech2->click; $mech2->success or die "post failed: ", $mech2->response->status_line; print "\t", $c, "\n"; $mech2->back(); } print $o, "\n"; $mech->back(); }

Partial output below:

   ACCT-2110
   ACCT-2210
   ACCT-2910
ACCT
ADED
AERO
...

It SHOULD have one of those initial blocks above each of the main blocks (ACCT, ADED, AERO)...it does not. This is the problem.

P.S. - my apologies for having initially placed this on my scratchpad...I am a newbie, and I had seen someone request that earlier of someone, thought it was generally acceptable. Admissions of foolishness should be implied...

update!!
Right, so it looks like the problem is a session timeout now. Any hints on keeping an https session alive longer when using WWW::Mechanize?

In reply to some screenscraping help... by knewter

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.