cheech has asked for the wisdom of the Perl Monks concerning the following question:

The code below submits a form which returns a page of weather data. I need the pages from every date available up to the current date, and the first available date is January 1 of 1896. So I loop over each date, submit the form, write the results page to a file, go back a page, and do it again for the next date..

However, for some reason I keep getting this error once I get to the the date 18961110 (November 10 of 1896).

Error GETing http://bub2.meteo.psu.edu/wxstn/wxstn.htm: Can't connect to bub2.me teo.psu.edu:80 (connect: timeout) at C:\Perl\scripts\i.pl line 19
The data for this date is available and I can access it manually by logging on to the site, so it's not that the server doesn't recognize the date.

What could be causing my program to fail at the same date each time? Also, it took about 3 minutes to get through October of the first year, yet I have over 100 years to go through. Is there any way for me to improve this code so that it was run faster?

Thanks a lot for any suggestions or advice!

use strict; use warnings; open(my $in, '<', "c:/perl/scripts/dates/dates.txt") or die "open: $!\n"; my @date; while (<$in>) { chomp; push(@date, $_); } close($in); use WWW::Mechanize; foreach my $date (@date) { my $mech = WWW::Mechanize->new(); $mech->get( "http://bub2.meteo.psu.edu/wxstn/wxstn.htm" ); $mech->form_number(1); $mech->field( 'dtg' , $date ); $mech->click(); $mech->content(); my $file = "c:/perl/scripts/dates/$date.txt"; $mech->save_content($file); $mech->back(); }

Replies are listed 'Best First'.
Re: Connection Timeout duing form submissions
by Perlbotics (Archbishop) on Jun 20, 2009 at 21:07 UTC

    Hi, did you check the content of dates.txt? Maybe there's a problem with the line - e.g. like a whitespace somewhere? The following alternative approach (cut and paste and modify from lwpcook) worked fine for me...
    Update: It directly accesses the service and does not require to fetch and parse the htm file each time. That should be slightly faster and reduces the network traffic (didn't bench-marked it).

    use LWP::UserAgent; $ua = LWP::UserAgent->new; my $req = HTTP::Request->new(POST => 'http://bub2.met.psu.edu/cgi-wi +n/WXDaily.EXE'); $req->content_type('application/x-www-form-urlencoded'); $req->content('dtg=18961110'); my $res = $ua->request($req); print $res->as_string;

    However, I am not sure if leeching approx. 41600 pages is a good idea. Maybe your IP or your user-agent is already on the black-list of their web-admin? My advice would be to contact the person responsible for this service and kindly ask for the raw data. Universities usually share such information for research purpose. Don't know what they do if you plan to use this information in a commercial context, though.

      dates.txt looks fine. No whitespace or incorrect numbers around 18961110.

      And as far as leeching the site for the files, this is a university site for the college I attend and have been instructed to gather this info by my advising instructor. The faculty is aware that such projects are taking place. The real question is why does the program keep failing at 1896110?

        Ok. I will suggest this again, run your program for some dates like August 1, 1921 to December 23, 1922.

        I think also that you should be "polite" regarding number of hits per second on the other website. The previous poster suggested this and I agree.

        Get your script working on a limited date range. Then expand that date range. Get your data and then "shut up". I would put some "sleep()" into the script and just let it run for a day. The data from 1920 isn't going to change. For your school project the objective shouldn't be: how to get this data as fast as possible, it should just be: how do I get this data?

        I also haven't yet seen any "this is what was sent" (the actual stuff) vs "this is what I received". I haven't seen any boundary test cases based upon what you have heard so far.

Re: Connection Timeout duing form submissions
by Marshall (Canon) on Jun 20, 2009 at 21:01 UTC
    There is a problem using "Unix epoch" time. This is normally the number of seconds +- Jan 1,1970 (some Mac O/S's vary) and is normally a 32 bit signed integer. Some date in 1896 will be more than the number of seconds away from the "epoch" than can be normally be represented.

    So you will have to do something different with @date for date/times before about year 1902.

    Update: Range from epoch is about 68 years.

      Unless I'm confused about what exactly you're talking about, the dates in @dates are already all listed out. So every entry I want to use is already stored for me.
        What I suspect is that your internal date calculation is not getting translated correctly from binary back into text. Maybe that idea is wrong. Could be. But then again, it seems to match up. Print the URL for the dates that are failing.
Re: Connection Timeout duing form submissions
by Anonymous Monk on Jun 21, 2009 at 02:48 UTC
    Seems to me like you're being blocked, that'll teach you :)

    2 problems

    • you're creating a new connection for each date iteration and subsequently you're downloading http://bub2.meteo.psu.edu/wxstn/wxstn.htm 41638-1 times too many
    • You not checking if you've already downloaded a particular date
Re: Connection Timeout duing form submissions
by Anonymous Monk on Jul 15, 2009 at 14:16 UTC
    #!/usr/bin/perl -- use strict; use warnings; use autodie 2.06 qw':all'; use File::Slurp 9999.13; use WWW::Mechanize 1.54; { chdir "c:/perl/scripts/"; mkdir 'dates' unless -d 'dates'; # save dates there my @date = read_file('dates.txt'); chomp(@date); my $mech = WWW::Mechanize->new( keep_alive => 1 ); $mech->agent_alias('Windows IE 6'); $mech->get("http://bub2.meteo.psu.edu/wxstn/wxstn.htm"); STDOUT->autoflush(1); use IO::Handle; my $counter = 0; foreach my $date (@date) { $counter++; my $file = "dates/$date.txt"; print "$date ( $file ) "; if ( -e $file and 1000 < -s _ ) { # fixed size 10,443 print " SKIPPING\n"; next; } eval { $mech->submit_form( form_number => 1, fields => { dtg => $date, } ); $mech->save_content($file); 1; } or print "$@"; print "\n"; $mech->back(); sleep 1 if 0 == $counter % 10; # sleep every 10th attempt } }