oli_latham has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I want to download a lot of PDFs (the daily report from the US congress for multiple years). The search form had some nasty Javascript in it so I had to use WWW::Mechanize::Firefox to navigate it.

The script works fine 90% of the time, but every now and then it fails to download a document and crashes out. I think the problem lies either with the MozRepl plugin for firefox or the adobe acrobat plugin which is used whenever I click on a link with an attached PDF.

Oh, and I'm very new to Perl, so be gentle.

Anyway here's my code:
#!\usr\bin\perl use strict; use WWW::Mechanize::Firefox; my $id="DELETED"; my $password="DELETED"; #Activate Agent my $mech = WWW::Mechanize::Firefox->new( activate => 1, bufsize => 1000_000_000, ); #Define set of search terms my $year="1951"; my @month=qw/01 02 03 04 05 06 07 08 09 10 11 12/; my @day=qw/ 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 2 +0 21 22 23 24 25 26 27 28 29 30 31/; my $month; foreach $month (@month) { my $day; foreach $day (@day) { my $search="CR-$year-$month$day"; print "Searching for document: $search\n"; #Go To Lexis Nexis $mech->get("http://web.lexis-nexis.com.ezp-prod1.hul.harvard.e +du/congcomp/form/cong/s_pubbasic.html?_m=62485f04b0083ffbe44503686c07 +79a2&wchp=dGLzVtb-zSkSA&_md5=9885e06fb7c73a073134e39a0198b6b7"); my $html1=$mech->content; if ($html1=~/\bHarvard University PIN Login\b/) { $mech->form_number(1); $mech->field("__authen_id" ,$id); $mech->field("__authen_password" ,$password); $mech->submit(); $mech->follow_link(n=>6); #follows link to content } else { $mech->reload($bypass_cache); #Fill in Search Form $mech->form_number(1); $mech->field("thes1",$search); $mech->click({xpath=>'/html/body/table/tbody/tr/td[2]/div/ +div/form/div[2]/div/div[2]/p[2]/a'}); #Check Whether any Results my $html=$mech->content; if ($html=~/\bNo Documents Found\b/) { print "CR-$year-$month$day not found\n\n" } else { #If find results negotiate way to PDF file $mech->follow_link(n=>9); $mech->follow_link(n=>10); $mech->follow_link(n=>18,synchronize=>0); #Download PDF To Disk my $file=$mech->uri(); my $filename="CR$year$month$day.pdf"; $mech->get($file,':content_file'=>$filename, synchroni +ze=>0); print "CR-$year-$month$day downloaded\n\n"; #sleep 2; } } } }

The sort of error messages I'm getting:

"An established connection was aborted by the software in your host machine"

"No result yet from repl at C:/strawberry/perl/site/lib/MozRepl/RemoteObject.pm line 708" (although this doesn't always make it crash).

"Pattern match read eof at C:/strawberry/perl/site/lib/MozRepl/Clinet.pm line 186

  • Comment on WWW::Mechanize::Firefox Stability Issues when downloading many pdfs
  • Download Code

Replies are listed 'Best First'.
Re: WWW::Mechanize::Firefox Stability Issues when downloading many pdfs
by Anonymous Monk on Apr 17, 2012 at 02:48 UTC

    I see this sometimes also. It's for sure a bug somewhere. It's a strange one as it's non-deterministic. You can set your program to crawl exactly the same pages and run it several times as a test. The point at which it starts having this bug will vary within a few gets. It won't crash out on the same page each time, but usually within 0-5 gets of one page.

    One method that often fixes it for me is to completely quit Firefox (check that no firefox ps's are leftover) and restart it.

    Sometimes even that won't fix it though. I find having your program go slower (insert lots of 5-10s sleeps) will help with the problem. In my case my scrapers remember "seen" pages and so once I make it past the problematic area the bug usually disappears again for a while.

      Oh, I should also mention, my programs have nothing to do with PDFs so I don't think it's related to the type of content you are getting. I'm just fetching normal HTML pages with a mix of text and images.