cheech has asked for the wisdom of the Perl Monks concerning the following question:

The code below is used to create a list of dates from 18960101 to 20090717 using Date::Simple, access and submit a form using Mechanize, and loop over all dates and scrape the resulting data pages for various climatological variables. The data is formatted using Fortran::Format and the array is printed to the screen and to a file.

I know the code is fairly messy, and I know there are many instances where several of my lines could be replaced with a single line or subroutine, etc. Please offer suggestions as to how I can reduce the total number of lines in the script or improve the readability in general. Also, any advice on improving the speed of the program would be appreciated.

Thanks for your input!

use strict; use warnings; # Declare globals my (@dates, @hlahdd, @rain, @snow); # Create date list use Date::Simple::D8 (':all'); my $start = Date::Simple::D8->new('19400101'); my $end = Date::Simple::D8->new('20090718'); while ( $start < $end ) { push @dates, "$start"; chomp(@dates); $start = $start->next; } # Open file for writing open(FH, '>', "c:/perl/scripts/496/hla.txt") or die "open: $!\n"; # Initiate browsing agent use WWW::Mechanize; my $url = "http://bub2.meteo.psu.edu/wxstn/wxstn.htm"; my $mech = WWW::Mechanize->new(keep_alive => 1); $mech->get($url); # Start the loop while (@dates) { # Submit the first form $mech->submit_form( form_number => 1, fields => { dtg => $dates[0], } ); # Download the resulting page, text only, and scrape for d +ata my $page = $mech->content(format=>'text'); my @data = ($page =~ /:\s\s\s\s(\d\d)/g); my @rain = ($page =~ /Rain or Liquid Equivalent\s+:\s+(\S* +)/); # Replace 'TRACE' if ($rain[0] eq 'TRACE') { $rain[0] = '0.00'; } my @snow = ($page =~ /Snow and\/or Ice Pellets\s+:\s+(\S*) +/); # Replace 'TRACE' if ($snow[0] eq 'TRACE') { $snow[0] = '0.00'; } my @depth = ($page =~ /Snow Depth\s+:\s+(\S*)/); # Replace '(N/A)/TRACE/0' if ($depth[0] eq '(N/A)' or $depth[0] eq 'TRACE' or$depth[ +0] eq '0') { $depth[0] = "99"; } my @hdd = ($page =~ /Degree-Days\s+:\s+(\S*)/); # Format the output for Fortran analysis use Fortran::Format; my $fdepth = Fortran::Format->new("I2.1,6X")->write($depth +[0]); chomp $fdepth; my $frain = Fortran::Format->new("F4.2,6X")->write($rain[0 +]); chomp $frain; my $fsnow = Fortran::Format->new("F4.2,6X")->write($snow[0 +]); chomp $fsnow; my $f = Fortran::Format->new("I2.1,6X")->write($hdd[0]); chomp $f; # Assign data to the array @hlahdd = ("$dates[0] $data[0] $data[1] $data[2] $fdepth $ +f $frain $fsnow\n"); # Print the array to screen and to file print "@hlahdd"; print FH "@hlahdd"; # Slow down boi... then go back a page sleep .1; $mech->back(); shift(@dates); } # Exit the loop # Close the written file close(FH);

Replies are listed 'Best First'.
Re: Improve My Code...
by GrandFather (Saint) on Aug 02, 2009 at 01:17 UTC

    1/ discover perltidy and use it.

    2/ Don't make variables needlessly global and don't declare anything before you need to. For example, declare @dates immediately before the while loop where it is populated. @rain and @snow are not used globally at all and in fact are redeclared inside a loop - that is just plain nasty! @hlahdd is used in a very small context, but is declared globally.

    3/ Use lexical file handles and good variable names generally. If FH were called $outFile the various comments describing FH's use would be redundant.

    4/ Avoid redundant interpolation - push @dates, "$start"; is better written push @dates, $start;.

    5/ Only the first element of @rain is used. Replace it with a scalar instead. The assignment becomes: my ($rain) = $page =~ /.../;. Ditto @snow, @depth, @hdd and @hlahdd.

    6/ built in functions such as open, close, push and shift don't require parenthesis and are generally used without them.


    True laziness is hard work
      Thanks for your suggestions, I appreciate it. This all made sense except one thing.. why when I change my pattern-match variables from arrays to scalars (@rain to $rain = $page +~ /.../ for example) do I need to have parentheses around them to get the output I want. Does it have something to do with scalar vs list data? Thanks again!
        Does it have something to do with scalar vs list data?

        Indeed it does. In scalar context (my $var =) a match regex returns true or false. In list context (my ($var) =) a match regex returns the list of captures. You are interested in just the contents of the first capture so:

        my ($var) = $page =~ /...(...).../;

        is the appropriate syntax.

        Update: provide complete code line to illustrate appropriate list context usage.


        True laziness is hard work
Re: Improve My Code...
by graff (Chancellor) on Aug 02, 2009 at 02:49 UTC
    Looking at this part:
    while ( $start < $end ) { push @dates, "$start"; chomp(@dates); $start = $start->next; }
    The "chomp" is completely unnecessary, because the string returned by the Date::Simple object "$start" does not contain any sort of new-line character at the end. And even in cases where "chomp" should be used, calling it repeatedly on the full array each time you add one element to the array is frankly a bit stupid -- either chomp each string before pushing it on the array, or else chomp the array just once after the loop finishes (not inside the loop).

    As for this part, pulling data out of the html page:

    my @data = ($page =~ /:\s\s\s\s(\d\d)/g);
    I wouldn't be so trusting as to assume that the web service you are pounding on so heavily will always use exactly four spaces and two digits for whatever it is that "@data" is supposed to capture, or that the particular values you want in @data will always be the first three strings in the page that match that pattern. You should use a real HTML parsing module, and base your data extraction on more explicit evidence about the content.

    (Update: Having just seen a sample of the page delivered by that Penn State web form, I'll retract the advice about using an HTML parser. Each page of single-day data is just a sequence of fixed-width plain text lines with a '<PRE>' tag at the start of each line. Interesting approach... In any case, the method suggested below would be an improvement: it can apply generally to any set of data fields on the page, and all you have to do is manage a hash of targeted data fields.)

    In addition the GrandFather's point about capturing a single value into a scalar instead of an array, your code would actually make more sense using a hash, and you could eliminate two repetitions of 7 lines of code by looping over hash keys "rain snow depth". (A hash of hashes can also store the regex to be matched and format to be used for each key, as well as the extracted values.) That way, the next time you change something that affects all three fields, you only have to make one edit instead of three. (DRY: Don't Repeat Yourself.)

    Anyway, I agree with merlyn's point that this is an offensive amount of web traffic to get a small amount of data that should be more easily (and reliably) accessible, in a more usable tabular form, by some other means.

Re: Improve My Code...
by merlyn (Sage) on Aug 02, 2009 at 00:39 UTC
    The code below is used to create a list of dates from 18960101 to 20090717 using Date::Simple, access and submit a form using Mechanize
    Just in principle, I'm going to object to this code on moral ground. That's roughly 45000 web hits, going through all the machinery to generate a web page for a non-human entity that will throw away 90% of what's generated. Even if it's your own machine (and if so, why bother with the web interface), it's a big waste of CPU. And if it's not your machine, it's unethical, unless you have written permission from the owners.

    Just because you can scrape something doesn't mean you should. Please learn the difference.

    -- Randal L. Schwartz, Perl hacker

    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

      According to the site's Statement of Purpose,
      We now have the capability to let the user choose a date of interest and get the weather statistics from that date. The entire record of daily observations is now available. Graphical representation of the data, especially long-term data, is included and is being augmented.

      cheech's use of the site is arguably reasonable. And maybe his hits will help them justify funding for more bandwidth or server horsepower--so he could be doing them a favor.

      Respectfully,
      Matt
        It says "the user choose a date". If I was writing that, I'd have the expectation that humans would be entering a date, not a mechanized attack of all dates.

        True, there's nothing that forbids what he wants to do, but it's also a matter of courtesy. Don't go hammering someone else's machine without permission, and I wouldn't consider that paragraph permission, just description.

        -- Randal L. Schwartz, Perl hacker

        The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

        Maybe they'll think about repackaging the data. Its static, organized by date, and a CGI interface isn't really the best way to deliver it.
      unless you have written permission from the owners.

      Go figure, I do. Read the op again, you didn't get what I was asking for I guess..

      FYI, it was made to be scraped. Please don't crusade at every opportunity.
Re: Improve My Code...
by Gangabass (Vicar) on Aug 02, 2009 at 11:24 UTC

    Also

    while (@dates) { ...... shift(@dates); }

    better to write as:

    foreach my $date (@dates) { ..... }
      I initially used foreach(@array){...} instead of while(@array){ ... shift(@array)} but changed it because I was getting booted from the server after a few hundred iterations, and was planning on restarting the loop from the current iteration. As it turned out, the problem was on my end of the connection, so I never needed to implement that and just never went back to foreach. Why though is it superior to while in this case?
Re: Improve My Code...
by toolic (Bishop) on Aug 02, 2009 at 03:06 UTC
    chomp(@dates);
    <silly>

    Only if they are pitted dates! (Otherwise, it's off to the dentist.)

    Looks like you've got some Perl poetry there :)

Re: Improve My Code...
by spazm (Monk) on Aug 02, 2009 at 22:13 UTC
    Seriously, he has to output to FORTRAN! ( Fortran::Format ). Think what that's doing to him! Show him some perl love!

    I've chosen a data driven design. I've extracted all the field options to one place $fields: defining the regexp to parse the string, possible fixups and the fortran output format.

    I've added Getopt::Long to let the user pick a specific range of dates. All of the configurable information is at the start of the script. I'd eventually pull the parsing into a module with a callback to print the output lines, so that the script just contains the part of user interest.

    WWW::Mechanize::Cached will build a nice local-side cache, which is handy when you realize half-way through a huge scrape run that you need to grab an extra field. I query if the last request came out of cache, and skip sleeping before the next request.

    #!/usr/bin/perl use strict; use warnings; use WWW::Mechanize::Cached; use Date::Simple::D8 (':all'); use Fortran::Format; use Getopt::Long; #configuration: my $start_date = '19400101'; my $last_date = '20090718'; my $filename = "c:/perl/scripts/496/hla.txt"; my $root_url = "http://bub2.meteo.psu.edu/wxstn/wxstn.htm"; my $help = 0; my $result = GetOptions( "start=s" => \$start_date, "end=s" => \$last_date, "help" => \$help, ); my $usage = <<EOS; $0 - Parse $root_url from $start_date to $last_date Options: --start=19400101 --end=20090718 EOS die $usage if ( $help or !$result ); my $fields = { depth => { regexp => qr/Rain or Liquid Equivalent\s+:\s+(\S*)/, format => "I2.1,6X", fix => { TRACE => 99, '(N/A)' => 99, '0' => 99 }, }, rain => { regexp => qr/Snow and\/or Ice Pellets\s+:\s+(\S*)/, format => "F4.2,6X", fix => { TRACE => '0.00' }, }, snow => { regexp => qr/Snow Depth\s+:\s+(\S*)/, format => "F4.2,6X", fix => { TRACE => '0.00' }, }, hdd => { regexp => qr/Degree-Days\s+:\s+(\S*)/, format => "I2.1,6X", } }; sub print_output { my ( $date, $data, $fortran ) = @_; my @data = @$data; my %fortran = %$fortran; my $output = join( " ", $date, @data[ 0 .. 2 ], @fortran{qw( depth hdd rain s +now )} ) . "\n"; print $output; print OUTPUT $output; } #### end configuration # Open file for writing open( OUTPUT, '>', $filename ) or die "open: $!\n"; # Initiate browsing agent my $mech = WWW::Mechanize::Cached->new( keep_alive => 1 ); # Create date list my $date = Date::Simple::D8->new($start_date)->prev; my $end_date = Date::Simple::D8->new($last_date); $mech->get($root_url); while ( $date->next <= $end ) { # Submit the first form my $resp = $mech->submit_form( form_number => 1, fields => { dtg => $date } ); # Download the resulting page, text only, and scrape for data my $page = $mech->content( format => 'text' ); my @data = ( $page =~ /:\s\s\s\s(\d\d)/g ); my %fortran; foreach $field ( keys %$fields ) { my $regexp = $fields->{$field}->{regexp}; my $format = $fields->{$field}->{format}; my $fix = $fields->{$field}->{fix} || {}; #parse page for this field my ($parsed) = $page =~ /$regexp/; #fix field foreach my $key ( keys %fix ) { $parsed = $fix{$key} if $parsed eq $key; } # Format the output for Fortran analysis chomp( my $f = Fortran::Format->new($format)->write($parsed) ) +; $fortran{$field} = $f; } # Prepare output for screen and file print_output( $date, \@data, \%fortran ); sleep .1 unless $mech->is_cached(); $mech->back(); } # Exit the loop # Close the written file close(FH);
      yes/no/maybe-so? What'd ya think?

      Specifically, I don't like how I treat @data differently from the other fields. What is in @data? Maybe all of the $fields should be grabbed as arrays, with a subfield indicating the desired indices?