cheech has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

The code below scrapes data (permission has been given for the accessing of the source website for collecting the large amount of data) for over 100 years worth of temperature data. This script takes around 8 hours to complete. How can I improve the speed and minimize my total run time?

Thanks,

Dave

#!/usr/bin/perl # Makes the script executable use strict; # ensures proper variable declarations, etc. use warnings; # allows warnings to be issued #print "Create a new directory: "; #my $dir = <STDIN>; #chomp ($dir); #mkdir "$dir"; # Open files for writing open DATA, ">tdata.txt" or die "open: $!\n"; open STATS, ">yearly_stats.txt" or die "open: $!\n"; print "Scraped data will be written to file 'tdata.txt'\n"; # Create date list use Date::Simple::D8 (':all'); my $today = Date::Simple::D8->today(); my $start = Date::Simple::D8->new('18960101'); my $end = Date::Simple::D8->new("$today"); my @dates; while ( $start < $end ) { push @dates, $start; + $start = $start->next; } # Initiate browsing agent print "Initiating the browsing agent...\n"; use WWW::Mechanize; my $url = "http://bub2.meteo.psu.edu/wxstn/wxstn.htm"; my $mech = WWW::Mechanize->new(keep_alive => 1); print "Accessing URL...\n"; $mech->get($url); print "Collecting data...\n"; # Start the scraping while (@dates) { $mech->submit_form( form_number => 1, fields => { dtg => $dates[0], } ); # Download the resulting page, text only, and scrape for data my $page = $mech->content(format=>'text'); # Daily max, min, average my @data = ($page =~ /Temperature\s+:\s+(\d\d)/g); # Daily 30-year max normal my ($thirtyyrhi) = $page =~ /30-Year Average High Temperature\s+:\ +s+(\S*)/; if ($thirtyyrhi eq '(N/A)') { $thirtyyrhi = "99.99"; } # Daily 30-year min normal my ($thirtyyrlo) = $page =~ /30-Year Average Low Temperature\s+:\s ++(\S*)/; if ($thirtyyrlo eq '(N/A)') { $thirtyyrlo = "99.99"; } # Assign data to the array my $hlahdd = ("$dates[0] $data[0] $data[1] $data[2] $thirtyyrhi $t +hirtyyrlo\n"); # Print the array to screen and to file print "$hlahdd"; print DATA "$hlahdd"; # Pause... then go back a page sleep .1; $mech->back(); # remove the date just used shift @dates; } # Exit the scraping loop # Close the written file close DATA; close STATS; print "Success!\n";

Replies are listed 'Best First'.
Re: Need to Improve Scraping Speed
by moritz (Cardinal) on Dec 03, 2009 at 17:30 UTC
    There are many possibilities, but the simplest are: don't fetch old data (which probably doesn't change), or ask the site owner to provide an easily archive which can easily be downloaded.

    If that's both not possible, you could send multiple requests in parallel. But that should really be a last resort.

      you could send multiple requests in parallel. But that should really be a last resort.

      Can you point me to some info on parallel requests and why it should be a last resort please?

      Thanks

Re: Need to Improve Scraping Speed
by gmargo (Hermit) on Dec 03, 2009 at 18:18 UTC

    If you know it takes 8 hours, that means you're done. Who cares now how long it took? You saved the data didn't you?

    Now stick it all in a database and periodically fetch only more recent data.

      The code will be bundled with other scripts and sent to other machines for running from scratch. I'm looking for coding improvements that might speed it up a bit.

      Thanks

        Probably 99% or more of that 8 hours is spent waiting on the server. You can't speed it up by fiddling with the client. Of course you could parallelize some fetches - but that is expensive for your good-will information provider.

        Probably your best bet would be to package up the already-downloaded text file (compress the heck out of it) and ship that off with your code. Then a bare machine will load that file first before updating from the server.

        I already gave you that answer after I gave you a program to scrape the data once and only once

        The solution is to scrape the data once and only once, repackage it, compress it, and host it as a few compressed files on an ftp server.

        CGI is the wrong way to distribute this data, and you shouldn't distribute this scraper program, its like distributing like a denial of service tool.

Re: Need to Improve Scraping Speed
by Corion (Patriarch) on Dec 03, 2009 at 17:24 UTC

    You say you've been given permission to scrape the data, but I still wonder how often you plan to scrape the 100 years worth of temperature data.

      Just once in my program, but I don't know how many other people will use it.