Hi

I want to download a lot of PDFs (the daily report from the US congress for multiple years). The search form had some nasty Javascript in it so I had to use WWW::Mechanize::Firefox to navigate it.

The script works fine 90% of the time, but every now and then it fails to download a document and crashes out. I think the problem lies either with the MozRepl plugin for firefox or the adobe acrobat plugin which is used whenever I click on a link with an attached PDF.

Oh, and I'm very new to Perl, so be gentle.

Anyway here's my code:
#!\usr\bin\perl use strict; use WWW::Mechanize::Firefox; my $id="DELETED"; my $password="DELETED"; #Activate Agent my $mech = WWW::Mechanize::Firefox->new( activate => 1, bufsize => 1000_000_000, ); #Define set of search terms my $year="1951"; my @month=qw/01 02 03 04 05 06 07 08 09 10 11 12/; my @day=qw/ 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 2 +0 21 22 23 24 25 26 27 28 29 30 31/; my $month; foreach $month (@month) { my $day; foreach $day (@day) { my $search="CR-$year-$month$day"; print "Searching for document: $search\n"; #Go To Lexis Nexis $mech->get("http://web.lexis-nexis.com.ezp-prod1.hul.harvard.e +du/congcomp/form/cong/s_pubbasic.html?_m=62485f04b0083ffbe44503686c07 +79a2&wchp=dGLzVtb-zSkSA&_md5=9885e06fb7c73a073134e39a0198b6b7"); my $html1=$mech->content; if ($html1=~/\bHarvard University PIN Login\b/) { $mech->form_number(1); $mech->field("__authen_id" ,$id); $mech->field("__authen_password" ,$password); $mech->submit(); $mech->follow_link(n=>6); #follows link to content } else { $mech->reload($bypass_cache); #Fill in Search Form $mech->form_number(1); $mech->field("thes1",$search); $mech->click({xpath=>'/html/body/table/tbody/tr/td[2]/div/ +div/form/div[2]/div/div[2]/p[2]/a'}); #Check Whether any Results my $html=$mech->content; if ($html=~/\bNo Documents Found\b/) { print "CR-$year-$month$day not found\n\n" } else { #If find results negotiate way to PDF file $mech->follow_link(n=>9); $mech->follow_link(n=>10); $mech->follow_link(n=>18,synchronize=>0); #Download PDF To Disk my $file=$mech->uri(); my $filename="CR$year$month$day.pdf"; $mech->get($file,':content_file'=>$filename, synchroni +ze=>0); print "CR-$year-$month$day downloaded\n\n"; #sleep 2; } } } }

The sort of error messages I'm getting:

"An established connection was aborted by the software in your host machine"

"No result yet from repl at C:/strawberry/perl/site/lib/MozRepl/RemoteObject.pm line 708" (although this doesn't always make it crash).

"Pattern match read eof at C:/strawberry/perl/site/lib/MozRepl/Clinet.pm line 186


In reply to WWW::Mechanize::Firefox Stability Issues when downloading many pdfs by oli_latham

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.