Re: Improve My Code...
by GrandFather (Saint) on Aug 02, 2009 at 01:17 UTC
|
1/ discover perltidy and use it.
2/ Don't make variables needlessly global and don't declare anything before you need to. For example, declare @dates immediately before the while loop where it is populated. @rain and @snow are not used globally at all and in fact are redeclared inside a loop - that is just plain nasty! @hlahdd is used in a very small context, but is declared globally.
3/ Use lexical file handles and good variable names generally. If FH were called $outFile the various comments describing FH's use would be redundant.
4/ Avoid redundant interpolation - push @dates, "$start"; is better written push @dates, $start;.
5/ Only the first element of @rain is used. Replace it with a scalar instead. The assignment becomes: my ($rain) = $page =~ /.../;. Ditto @snow, @depth, @hdd and @hlahdd.
6/ built in functions such as open, close, push and shift don't require parenthesis and are generally used without them.
True laziness is hard work
| [reply] [d/l] [select] |
|
|
Thanks for your suggestions, I appreciate it. This all made sense except one thing.. why when I change my pattern-match variables from arrays to scalars (@rain to $rain = $page +~ /.../ for example) do I need to have parentheses around them to get the output I want. Does it have something to do with scalar vs list data? Thanks again!
| [reply] [d/l] [select] |
|
|
my ($var) = $page =~ /...(...).../;
is the appropriate syntax.
Update: provide complete code line to illustrate appropriate list context usage.
True laziness is hard work
| [reply] [d/l] [select] |
Re: Improve My Code...
by graff (Chancellor) on Aug 02, 2009 at 02:49 UTC
|
while ( $start < $end ) {
push @dates, "$start";
chomp(@dates);
$start = $start->next;
}
The "chomp" is completely unnecessary, because the string returned by the Date::Simple object "$start" does not contain any sort of new-line character at the end. And even in cases where "chomp" should be used, calling it repeatedly on the full array each time you add one element to the array is frankly a bit stupid -- either chomp each string before pushing it on the array, or else chomp the array just once after the loop finishes (not inside the loop).
As for this part, pulling data out of the html page:
my @data = ($page =~ /:\s\s\s\s(\d\d)/g);
I wouldn't be so trusting as to assume that the web service you are pounding on so heavily will always use exactly four spaces and two digits for whatever it is that "@data" is supposed to capture, or that the particular values you want in @data will always be the first three strings in the page that match that pattern. You should use a real HTML parsing module, and base your data extraction on more explicit evidence about the content.
(Update: Having just seen a sample of the page delivered by that Penn State web form, I'll retract the advice about using an HTML parser. Each page of single-day data is just a sequence of fixed-width plain text lines with a '<PRE>' tag at the start of each line. Interesting approach... In any case, the method suggested below would be an improvement: it can apply generally to any set of data fields on the page, and all you have to do is manage a hash of targeted data fields.)
In addition the GrandFather's point about capturing a single value into a scalar instead of an array, your code would actually make more sense using a hash, and you could eliminate two repetitions of 7 lines of code by looping over hash keys "rain snow depth". (A hash of hashes can also store the regex to be matched and format to be used for each key, as well as the extracted values.) That way, the next time you change something that affects all three fields, you only have to make one edit instead of three. (DRY: Don't Repeat Yourself.)
Anyway, I agree with merlyn's point that this is an offensive amount of web traffic to get a small amount of data that should be more easily (and reliably) accessible, in a more usable tabular form, by some other means. | [reply] [d/l] [select] |
Re: Improve My Code...
by merlyn (Sage) on Aug 02, 2009 at 00:39 UTC
|
| [reply] |
|
|
| [reply] |
|
|
It says "the user choose a date". If I was writing that, I'd have the expectation that humans would be entering a date, not a mechanized attack of all dates.
True, there's nothing that forbids what he wants to do, but it's also a matter of courtesy. Don't go hammering someone else's machine without permission, and I wouldn't consider that paragraph permission, just description.
-- Randal L. Schwartz, Perl hacker
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
| [reply] |
|
|
Maybe they'll think about repackaging the data.
Its static, organized by date, and a CGI interface isn't really the best way to deliver it.
| [reply] |
|
|
|
|
|
|
| [reply] |
|
|
| [reply] |
|
|
FYI, it was made to be scraped. Please don't crusade at every opportunity.
| [reply] |
Re: Improve My Code...
by Gangabass (Vicar) on Aug 02, 2009 at 11:24 UTC
|
while (@dates) {
......
shift(@dates);
}
better to write as:
foreach my $date (@dates) {
.....
}
| [reply] [d/l] [select] |
|
|
I initially used foreach(@array){...} instead of while(@array){ ... shift(@array)} but changed it because I was getting booted from the server after a few hundred iterations, and was planning on restarting the loop from the current iteration. As it turned out, the problem was on my end of the connection, so I never needed to implement that and just never went back to foreach. Why though is it superior to while in this case?
| [reply] [d/l] [select] |
Re: Improve My Code...
by toolic (Bishop) on Aug 02, 2009 at 03:06 UTC
|
chomp(@dates);
<silly>
Only if they are pitted dates! (Otherwise, it's off to the dentist.)
Looks like you've got some Perl poetry there :)
| [reply] |
Re: Improve My Code...
by spazm (Monk) on Aug 02, 2009 at 22:13 UTC
|
Seriously, he has to output to FORTRAN! ( Fortran::Format ). Think what that's doing to him! Show him some perl love!
I've chosen a data driven design. I've extracted all the field options to one place $fields: defining the regexp to parse the string, possible fixups and the fortran output format.
I've added Getopt::Long to let the user pick a specific range of dates. All of the configurable information is at the start of the script. I'd eventually pull the parsing into a module with a callback to print the output lines, so that the script just contains the part of user interest.
WWW::Mechanize::Cached will build a nice local-side cache, which is handy when you realize half-way through a huge scrape run that you need to grab an extra field. I query if the last request came out of cache, and skip sleeping before the next request.
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize::Cached;
use Date::Simple::D8 (':all');
use Fortran::Format;
use Getopt::Long;
#configuration:
my $start_date = '19400101';
my $last_date = '20090718';
my $filename = "c:/perl/scripts/496/hla.txt";
my $root_url = "http://bub2.meteo.psu.edu/wxstn/wxstn.htm";
my $help = 0;
my $result = GetOptions(
"start=s" => \$start_date,
"end=s" => \$last_date,
"help" => \$help,
);
my $usage = <<EOS;
$0 - Parse $root_url from $start_date to $last_date
Options:
--start=19400101
--end=20090718
EOS
die $usage if ( $help or !$result );
my $fields = {
depth => {
regexp => qr/Rain or Liquid Equivalent\s+:\s+(\S*)/,
format => "I2.1,6X",
fix => { TRACE => 99, '(N/A)' => 99, '0' => 99 },
},
rain => {
regexp => qr/Snow and\/or Ice Pellets\s+:\s+(\S*)/,
format => "F4.2,6X",
fix => { TRACE => '0.00' },
},
snow => {
regexp => qr/Snow Depth\s+:\s+(\S*)/,
format => "F4.2,6X",
fix => { TRACE => '0.00' },
},
hdd => {
regexp => qr/Degree-Days\s+:\s+(\S*)/,
format => "I2.1,6X",
}
};
sub print_output {
my ( $date, $data, $fortran ) = @_;
my @data = @$data;
my %fortran = %$fortran;
my $output =
join( " ", $date, @data[ 0 .. 2 ], @fortran{qw( depth hdd rain s
+now )} )
. "\n";
print $output;
print OUTPUT $output;
}
#### end configuration
# Open file for writing
open( OUTPUT, '>', $filename ) or die "open: $!\n";
# Initiate browsing agent
my $mech = WWW::Mechanize::Cached->new( keep_alive => 1 );
# Create date list
my $date = Date::Simple::D8->new($start_date)->prev;
my $end_date = Date::Simple::D8->new($last_date);
$mech->get($root_url);
while ( $date->next <= $end ) {
# Submit the first form
my $resp = $mech->submit_form(
form_number => 1,
fields => { dtg => $date }
);
# Download the resulting page, text only, and scrape for data
my $page = $mech->content( format => 'text' );
my @data = ( $page =~ /:\s\s\s\s(\d\d)/g );
my %fortran;
foreach $field ( keys %$fields ) {
my $regexp = $fields->{$field}->{regexp};
my $format = $fields->{$field}->{format};
my $fix = $fields->{$field}->{fix} || {};
#parse page for this field
my ($parsed) = $page =~ /$regexp/;
#fix field
foreach my $key ( keys %fix ) {
$parsed = $fix{$key}
if $parsed eq $key;
}
# Format the output for Fortran analysis
chomp( my $f = Fortran::Format->new($format)->write($parsed) )
+;
$fortran{$field} = $f;
}
# Prepare output for screen and file
print_output( $date, \@data, \%fortran );
sleep .1 unless $mech->is_cached();
$mech->back();
} # Exit the loop
# Close the written file
close(FH);
| [reply] [d/l] [select] |
|
|
| [reply] [d/l] [select] |