in reply to Improve My Code...

Seriously, he has to output to FORTRAN! ( Fortran::Format ). Think what that's doing to him! Show him some perl love!

I've chosen a data driven design. I've extracted all the field options to one place $fields: defining the regexp to parse the string, possible fixups and the fortran output format.

I've added Getopt::Long to let the user pick a specific range of dates. All of the configurable information is at the start of the script. I'd eventually pull the parsing into a module with a callback to print the output lines, so that the script just contains the part of user interest.

WWW::Mechanize::Cached will build a nice local-side cache, which is handy when you realize half-way through a huge scrape run that you need to grab an extra field. I query if the last request came out of cache, and skip sleeping before the next request.

#!/usr/bin/perl use strict; use warnings; use WWW::Mechanize::Cached; use Date::Simple::D8 (':all'); use Fortran::Format; use Getopt::Long; #configuration: my $start_date = '19400101'; my $last_date = '20090718'; my $filename = "c:/perl/scripts/496/hla.txt"; my $root_url = "http://bub2.meteo.psu.edu/wxstn/wxstn.htm"; my $help = 0; my $result = GetOptions( "start=s" => \$start_date, "end=s" => \$last_date, "help" => \$help, ); my $usage = <<EOS; $0 - Parse $root_url from $start_date to $last_date Options: --start=19400101 --end=20090718 EOS die $usage if ( $help or !$result ); my $fields = { depth => { regexp => qr/Rain or Liquid Equivalent\s+:\s+(\S*)/, format => "I2.1,6X", fix => { TRACE => 99, '(N/A)' => 99, '0' => 99 }, }, rain => { regexp => qr/Snow and\/or Ice Pellets\s+:\s+(\S*)/, format => "F4.2,6X", fix => { TRACE => '0.00' }, }, snow => { regexp => qr/Snow Depth\s+:\s+(\S*)/, format => "F4.2,6X", fix => { TRACE => '0.00' }, }, hdd => { regexp => qr/Degree-Days\s+:\s+(\S*)/, format => "I2.1,6X", } }; sub print_output { my ( $date, $data, $fortran ) = @_; my @data = @$data; my %fortran = %$fortran; my $output = join( " ", $date, @data[ 0 .. 2 ], @fortran{qw( depth hdd rain s +now )} ) . "\n"; print $output; print OUTPUT $output; } #### end configuration # Open file for writing open( OUTPUT, '>', $filename ) or die "open: $!\n"; # Initiate browsing agent my $mech = WWW::Mechanize::Cached->new( keep_alive => 1 ); # Create date list my $date = Date::Simple::D8->new($start_date)->prev; my $end_date = Date::Simple::D8->new($last_date); $mech->get($root_url); while ( $date->next <= $end ) { # Submit the first form my $resp = $mech->submit_form( form_number => 1, fields => { dtg => $date } ); # Download the resulting page, text only, and scrape for data my $page = $mech->content( format => 'text' ); my @data = ( $page =~ /:\s\s\s\s(\d\d)/g ); my %fortran; foreach $field ( keys %$fields ) { my $regexp = $fields->{$field}->{regexp}; my $format = $fields->{$field}->{format}; my $fix = $fields->{$field}->{fix} || {}; #parse page for this field my ($parsed) = $page =~ /$regexp/; #fix field foreach my $key ( keys %fix ) { $parsed = $fix{$key} if $parsed eq $key; } # Format the output for Fortran analysis chomp( my $f = Fortran::Format->new($format)->write($parsed) ) +; $fortran{$field} = $f; } # Prepare output for screen and file print_output( $date, \@data, \%fortran ); sleep .1 unless $mech->is_cached(); $mech->back(); } # Exit the loop # Close the written file close(FH);

Replies are listed 'Best First'.
Re^2: Improve My Code...
by spazm (Monk) on Aug 03, 2009 at 06:46 UTC
    yes/no/maybe-so? What'd ya think?

    Specifically, I don't like how I treat @data differently from the other fields. What is in @data? Maybe all of the $fields should be grabbed as arrays, with a subfield indicating the desired indices?