oli_latham has asked for the wisdom of the Perl Monks concerning the following question:
Hi
I want to download a lot of PDFs (the daily report from the US congress for multiple years). The search form had some nasty Javascript in it so I had to use WWW::Mechanize::Firefox to navigate it.
The script works fine 90% of the time, but every now and then it fails to download a document and crashes out. I think the problem lies either with the MozRepl plugin for firefox or the adobe acrobat plugin which is used whenever I click on a link with an attached PDF.
Oh, and I'm very new to Perl, so be gentle.
Anyway here's my code:#!\usr\bin\perl use strict; use WWW::Mechanize::Firefox; my $id="DELETED"; my $password="DELETED"; #Activate Agent my $mech = WWW::Mechanize::Firefox->new( activate => 1, bufsize => 1000_000_000, ); #Define set of search terms my $year="1951"; my @month=qw/01 02 03 04 05 06 07 08 09 10 11 12/; my @day=qw/ 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 2 +0 21 22 23 24 25 26 27 28 29 30 31/; my $month; foreach $month (@month) { my $day; foreach $day (@day) { my $search="CR-$year-$month$day"; print "Searching for document: $search\n"; #Go To Lexis Nexis $mech->get("http://web.lexis-nexis.com.ezp-prod1.hul.harvard.e +du/congcomp/form/cong/s_pubbasic.html?_m=62485f04b0083ffbe44503686c07 +79a2&wchp=dGLzVtb-zSkSA&_md5=9885e06fb7c73a073134e39a0198b6b7"); my $html1=$mech->content; if ($html1=~/\bHarvard University PIN Login\b/) { $mech->form_number(1); $mech->field("__authen_id" ,$id); $mech->field("__authen_password" ,$password); $mech->submit(); $mech->follow_link(n=>6); #follows link to content } else { $mech->reload($bypass_cache); #Fill in Search Form $mech->form_number(1); $mech->field("thes1",$search); $mech->click({xpath=>'/html/body/table/tbody/tr/td[2]/div/ +div/form/div[2]/div/div[2]/p[2]/a'}); #Check Whether any Results my $html=$mech->content; if ($html=~/\bNo Documents Found\b/) { print "CR-$year-$month$day not found\n\n" } else { #If find results negotiate way to PDF file $mech->follow_link(n=>9); $mech->follow_link(n=>10); $mech->follow_link(n=>18,synchronize=>0); #Download PDF To Disk my $file=$mech->uri(); my $filename="CR$year$month$day.pdf"; $mech->get($file,':content_file'=>$filename, synchroni +ze=>0); print "CR-$year-$month$day downloaded\n\n"; #sleep 2; } } } }
The sort of error messages I'm getting:
"An established connection was aborted by the software in your host machine"
"No result yet from repl at C:/strawberry/perl/site/lib/MozRepl/RemoteObject.pm line 708" (although this doesn't always make it crash).
"Pattern match read eof at C:/strawberry/perl/site/lib/MozRepl/Clinet.pm line 186
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: WWW::Mechanize::Firefox Stability Issues when downloading many pdfs
by Anonymous Monk on Apr 17, 2012 at 02:48 UTC | |
by tcordes (Novice) on Apr 17, 2012 at 02:50 UTC |