comment on

I want to download a lot of PDFs (the daily report from the US congress for multiple years). The search form had some nasty Javascript in it so I had to use WWW::Mechanize::Firefox to navigate it.

The script works fine 90% of the time, but every now and then it fails to download a document and crashes out. I think the problem lies either with the MozRepl plugin for firefox or the adobe acrobat plugin which is used whenever I click on a link with an attached PDF.

Oh, and I'm very new to Perl, so be gentle.

Anyway here's my code:

#!\usr\bin\perl
use strict;
use WWW::Mechanize::Firefox;

my $id="DELETED";
my $password="DELETED";

#Activate Agent

my $mech = WWW::Mechanize::Firefox->new(
        activate => 1,
    bufsize => 1000_000_000,
);

#Define set of search terms
my $year="1951";
my @month=qw/01 02 03 04 05 06 07 08 09 10 11 12/;
my @day=qw/ 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 2
+0 21
22 23 24 25 26 27 28 29 30 31/;
my $month;
foreach $month (@month) {
    my $day;
    foreach $day (@day) {
        my $search="CR-$year-$month$day";
        print "Searching for document: $search\n";
        #Go To Lexis Nexis
        $mech->get("http://web.lexis-nexis.com.ezp-prod1.hul.harvard.e
+du/congcomp/form/cong/s_pubbasic.html?_m=62485f04b0083ffbe44503686c07
+79a2&wchp=dGLzVtb-zSkSA&_md5=9885e06fb7c73a073134e39a0198b6b7");
        my $html1=$mech->content;
        if ($html1=~/\bHarvard University PIN Login\b/) {
            $mech->form_number(1);
            $mech->field("__authen_id" ,$id);
            $mech->field("__authen_password" ,$password);
            $mech->submit();
            $mech->follow_link(n=>6); #follows link to content
} else {
            $mech->reload($bypass_cache);
            #Fill in Search Form
            $mech->form_number(1);
            $mech->field("thes1",$search);
            $mech->click({xpath=>'/html/body/table/tbody/tr/td[2]/div/
+div/form/div[2]/div/div[2]/p[2]/a'}); 

            #Check Whether any Results
            my $html=$mech->content;
            if ($html=~/\bNo Documents Found\b/) {
            print "CR-$year-$month$day not found\n\n"
            } else {
                #If find results negotiate way to PDF file
                $mech->follow_link(n=>9);
                $mech->follow_link(n=>10);
                $mech->follow_link(n=>18,synchronize=>0);
                #Download PDF To Disk
                my $file=$mech->uri();
                my $filename="CR$year$month$day.pdf";
                $mech->get($file,':content_file'=>$filename, synchroni
+ze=>0);
                print "CR-$year-$month$day downloaded\n\n";
                #sleep 2;
                }
            }
        }
    }
[download]

The sort of error messages I'm getting:

"An established connection was aborted by the software in your host machine"

"No result yet from repl at C:/strawberry/perl/site/lib/MozRepl/RemoteObject.pm line 708" (although this doesn't always make it crash).

"Pattern match read eof at C:/strawberry/perl/site/lib/MozRepl/Clinet.pm line 186

In reply to WWW::Mechanize::Firefox Stability Issues when downloading many pdfs by oli_latham

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.