Need to Improve Scraping Speed

cheech has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

The code below scrapes data (permission has been given for the accessing of the source website for collecting the large amount of data) for over 100 years worth of temperature data. This script takes around 8 hours to complete. How can I improve the speed and minimize my total run time?

Thanks,

Dave

#!/usr/bin/perl  
# Makes the script executable 

use strict; # ensures proper variable declarations, etc.
use warnings;    # allows warnings to be issued

#print "Create a new directory: ";
#my $dir = <STDIN>;
#chomp ($dir);
#mkdir "$dir";

# Open files for writing
open DATA, ">tdata.txt" or die "open: $!\n";
open STATS, ">yearly_stats.txt" or die "open: $!\n";
print "Scraped data will be written to file 'tdata.txt'\n";

# Create date list
use Date::Simple::D8 (':all');
    my $today = Date::Simple::D8->today();
    my $start = Date::Simple::D8->new('18960101');     
    my $end = Date::Simple::D8->new("$today");     
    my @dates;
    while ( $start < $end ) {                        
        push @dates, $start;                                          
+      
        $start = $start->next;                        
    }

# Initiate browsing agent
print "Initiating the browsing agent...\n";
use WWW::Mechanize;
my $url = "http://bub2.meteo.psu.edu/wxstn/wxstn.htm";
my $mech = WWW::Mechanize->new(keep_alive => 1);
print "Accessing URL...\n";
$mech->get($url);

print "Collecting data...\n";
# Start the scraping
while (@dates) {
    $mech->submit_form(
        form_number => 1,
        fields      => { dtg => $dates[0], }
    );
    
    # Download the resulting page, text only, and scrape for data
    my $page = $mech->content(format=>'text');
    
    # Daily max, min, average
    my @data = ($page =~ /Temperature\s+:\s+(\d\d)/g);
    
    # Daily 30-year max normal
    my ($thirtyyrhi) = $page =~ /30-Year Average High Temperature\s+:\
+s+(\S*)/;
    if ($thirtyyrhi eq '(N/A)') {
        $thirtyyrhi = "99.99";
    }
    
    # Daily 30-year min normal
    my ($thirtyyrlo) = $page =~ /30-Year Average Low Temperature\s+:\s
++(\S*)/;
    if ($thirtyyrlo eq '(N/A)') {
        $thirtyyrlo = "99.99";
    }
    
    # Assign data to the array
    my $hlahdd = ("$dates[0] $data[0] $data[1] $data[2] $thirtyyrhi $t
+hirtyyrlo\n");
    
    # Print the array to screen and to file
    print "$hlahdd";
    print DATA "$hlahdd"; 
    
    # Pause... then go back a page
    sleep .1;    
    $mech->back();
    
    # remove the date just used
    shift @dates;
    
}    # Exit the scraping loop

# Close the written file
close DATA;
close STATS;

print "Success!\n";
[download]

Comment on Need to Improve Scraping Speed Download Code

Replies are listed 'Best First'.
Re: Need to Improve Scraping Speed by moritz (Cardinal) on Dec 03, 2009 at 17:30 UTC
There are many possibilities, but the simplest are: don't fetch old data (which probably doesn't change), or ask the site owner to provide an easily archive which can easily be downloaded. If that's both not possible, you could send multiple requests in parallel. But that should really be a last resort.	[reply]
Re^2: Need to Improve Scraping Speed by cheech (Beadle) on Dec 03, 2009 at 19:59 UTC
you could send multiple requests in parallel. But that should really be a last resort. Can you point me to some info on parallel requests and why it should be a last resort please? Thanks	[reply]
Re: Need to Improve Scraping Speed by gmargo (Hermit) on Dec 03, 2009 at 18:18 UTC
If you know it takes 8 hours, that means you're done. Who cares now how long it took? You saved the data didn't you? Now stick it all in a database and periodically fetch only more recent data.	[reply]
Re^2: Need to Improve Scraping Speed by cheech (Beadle) on Dec 03, 2009 at 20:02 UTC
The code will be bundled with other scripts and sent to other machines for running from scratch. I'm looking for coding improvements that might speed it up a bit. Thanks	[reply]
Re^3: Need to Improve Scraping Speed by gmargo (Hermit) on Dec 03, 2009 at 20:10 UTC
Probably 99% or more of that 8 hours is spent waiting on the server. You can't speed it up by fiddling with the client. Of course you could parallelize some fetches - but that is expensive for your good-will information provider. Probably your best bet would be to package up the already-downloaded text file (compress the heck out of it) and ship that off with your code. Then a bare machine will load that file first before updating from the server.	[reply]
Re^3: Need to Improve Scraping Speed by Anonymous Monk on Dec 04, 2009 at 02:25 UTC
I already gave you that answer after I gave you a program to scrape the data once and only once The solution is to scrape the data once and only once, repackage it, compress it, and host it as a few compressed files on an ftp server. CGI is the wrong way to distribute this data, and you shouldn't distribute this scraper program, its like distributing like a denial of service tool.	[reply]
Re: Need to Improve Scraping Speed by Corion (Patriarch) on Dec 03, 2009 at 17:24 UTC
You say you've been given permission to scrape the data, but I still wonder how often you plan to scrape the 100 years worth of temperature data.	[reply]
Re^2: Need to Improve Scraping Speed by cheech (Beadle) on Dec 03, 2009 at 19:54 UTC
Just once in my program, but I don't know how many other people will use it.	[reply]