nuts has asked for the wisdom of the Perl Monks concerning the following question:

The following script works very well to grab web pages in bulk off the web. It is operated from a command prompt, you hand it a list of urls and an output directory, and it downloads all the pages and saves them individually as numbered files.

I want to modify it to include 'referer' info from the original link. For example, many database websites require that you be referred by the correct page, or you are redirected to a generic error page. You cannot paste the link into a new browser window.

I believe it is possible to insert referer info with perl, but I am not sure how.

Thanks in advance
Mike
#!/usr/bin/perl require LWP::UserAgent; require HTTP::Request; require HTTP::Response; use HTTP::Request::Common; foreach (@ARGV) { if ( $_ eq $ARGV[0] ) { $inputfile = $_; } elsif ( $_ eq $ARGV[1] ) { $outdir = $ARGV[1]; } else { die "Usage: $0 inputfile outdir\n"; } } print "Welcome\n"; print "Opening inputfile... "; open (LINKFILE,"$inputfile") or die "Couldn't open the inputfile, $!"; @links = <LINKFILE>; close(LINKFILE); print "Sucess!\n"; # unless (-e $outdir){ # print "Directory doesn't exist... Creating\n"; # mkdir "$outdir", 755 or die "Couldn't make directory, $!"; # } if(!opendir (OUTDIR, "$outdir")){ mkdir "$outdir",755; print "Output directory created!\n"; } else{print "Output directory exists!\n";} print "Changing directory... "; chdir "$outdir" or die "Couldn't change directory, $!"; print "Success!\n"; # Check to see if we hung up last time # this doesn't resume, just warns you that it stopped somewhere # in earlier versions of the program i had problems with the # program hanging, but I don't know why. if (-e "spiderlog.txt"){ open (LOG,"spiderlog.txt"); @spiderlog = reverse <LOG>; close(LOG); $lastline = chomp($spiderlog[0]); if ($lastline ne "Done"){ print "Spider not finished... Last line in log says: $lastline +\n"; } } $filenum = 1; $ua = new LWP::UserAgent; $ua->agent('ChrisBot/1.0'); print "Start spidering process...\n\n"; $total = @links; $start = time(); open (LOG,">>spiderlog.txt"); print LOG "Started at: $start\n\n"; foreach $line (@links){ print "Getting $line"; $response = $ua->request(GET $line); if ($response->is_success) { $content = $response->content; if ($filenum =~ /\d\d\d\d/) {$filenum = $filenum; } elsif ($filenum =~ /\d\d\d/) {$filenum = "0$filenum"; } elsif ($filenum =~ /\d\d/) {$filenum = "00$filenum"; } else {$filenum = "000$filenum"; } open (NEWPAGE,">$filenum.html"); print NEWPAGE $response->content; close (NEWPAGE); print "$filenum.html generated\n\n"; print LOG "$filenum - $line"; $filenum++; } else { print $response->error_as_HTML; } } $end = time(); $parse = $end - $start; $parse = 1 unless($parse); $lps = int($total/$parse); print "$total lines in $parse seconds ($lps lines/sec)\n"; print LOG "$total lines in $parse seconds\nFinished at $end\nDone\n"; close (LOG); print "clumping files... \n"; system "cat *.html > masterfile.htm"; print "Done!\n";

Replies are listed 'Best First'.
Re: Adding 'referer' info to spider script
by valdez (Monsignor) on Aug 08, 2003 at 17:38 UTC
Re: Adding 'referer' info to spider script
by swiftone (Curate) on Aug 08, 2003 at 18:27 UTC
    Here are a few comments on slimming down your code while still keeping it readable (or even improving the readability) Of course, this is all My Not So Humble Opinion, so take with salt. Feel free to see this as a vast exercise in Hubris on my part.

    #!/usr/bin/perl require LWP::UserAgent; require HTTP::Request; require HTTP::Response; use HTTP::Request::Common;
    First, I'd recommend using perl with the -w (warn) option, and "use strict;" These can save you hours of debugging, and encourage good programming habits. At first it may seem a pain, but with a little practice they add no noticed effort, and you tend to do things a "Right Way" by default. I'd also "use" all those modules rather than "require"ing them. This imports as the module author intended, and if I disagree, I can override the authors defaults. See use for details.
    foreach (@ARGV) { if ( $_ eq $ARGV[0] ) { $inputfile = $_; } elsif ( $_ eq $ARGV[1] ) { $outdir = $ARGV[1]; } else { die "Usage: $0 inputfile outdir\n"; } }
    This is an unusual way of going about it. You copy the first two arguments, and die if there are more. I prefer the more succint:
    die "Usage: $0 inputfile outdir\n" unless scalar @ARGV == 2; #I prefer "scalar @LIST", some prefer $#LIST, #but remember the difference my ($inputfile, $outdir) = @ARGV;
    This has the advantage of working as intended (well, dieing as intended) if only one argument is given.

    Just one more:

    if ($filenum =~ /\d\d\d\d/) {$filenum = $filenum; } elsif ($filenum =~ /\d\d\d/) {$filenum = "0$filenum"; } elsif ($filenum =~ /\d\d/) {$filenum = "00$filenum"; } else {$filenum = "000$filenum"; }
    How about:
    $filenum = sprintf("%04d", $filenum);