The following script works very well to grab web pages in bulk off the web. It is operated from a command prompt, you hand it a list of urls and an output directory, and it downloads all the pages and saves them individually as numbered files.

I want to modify it to include 'referer' info from the original link. For example, many database websites require that you be referred by the correct page, or you are redirected to a generic error page. You cannot paste the link into a new browser window.

I believe it is possible to insert referer info with perl, but I am not sure how.

Thanks in advance
Mike
#!/usr/bin/perl require LWP::UserAgent; require HTTP::Request; require HTTP::Response; use HTTP::Request::Common; foreach (@ARGV) { if ( $_ eq $ARGV[0] ) { $inputfile = $_; } elsif ( $_ eq $ARGV[1] ) { $outdir = $ARGV[1]; } else { die "Usage: $0 inputfile outdir\n"; } } print "Welcome\n"; print "Opening inputfile... "; open (LINKFILE,"$inputfile") or die "Couldn't open the inputfile, $!"; @links = <LINKFILE>; close(LINKFILE); print "Sucess!\n"; # unless (-e $outdir){ # print "Directory doesn't exist... Creating\n"; # mkdir "$outdir", 755 or die "Couldn't make directory, $!"; # } if(!opendir (OUTDIR, "$outdir")){ mkdir "$outdir",755; print "Output directory created!\n"; } else{print "Output directory exists!\n";} print "Changing directory... "; chdir "$outdir" or die "Couldn't change directory, $!"; print "Success!\n"; # Check to see if we hung up last time # this doesn't resume, just warns you that it stopped somewhere # in earlier versions of the program i had problems with the # program hanging, but I don't know why. if (-e "spiderlog.txt"){ open (LOG,"spiderlog.txt"); @spiderlog = reverse <LOG>; close(LOG); $lastline = chomp($spiderlog[0]); if ($lastline ne "Done"){ print "Spider not finished... Last line in log says: $lastline +\n"; } } $filenum = 1; $ua = new LWP::UserAgent; $ua->agent('ChrisBot/1.0'); print "Start spidering process...\n\n"; $total = @links; $start = time(); open (LOG,">>spiderlog.txt"); print LOG "Started at: $start\n\n"; foreach $line (@links){ print "Getting $line"; $response = $ua->request(GET $line); if ($response->is_success) { $content = $response->content; if ($filenum =~ /\d\d\d\d/) {$filenum = $filenum; } elsif ($filenum =~ /\d\d\d/) {$filenum = "0$filenum"; } elsif ($filenum =~ /\d\d/) {$filenum = "00$filenum"; } else {$filenum = "000$filenum"; } open (NEWPAGE,">$filenum.html"); print NEWPAGE $response->content; close (NEWPAGE); print "$filenum.html generated\n\n"; print LOG "$filenum - $line"; $filenum++; } else { print $response->error_as_HTML; } } $end = time(); $parse = $end - $start; $parse = 1 unless($parse); $lps = int($total/$parse); print "$total lines in $parse seconds ($lps lines/sec)\n"; print LOG "$total lines in $parse seconds\nFinished at $end\nDone\n"; close (LOG); print "clumping files... \n"; system "cat *.html > masterfile.htm"; print "Done!\n";

In reply to Adding 'referer' info to spider script by nuts

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.