no1uno has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a program that will create a data file using information retrieved from the internet. Note that I am pretty new to Perl; I am used to programming in C and Java and have decided to use Perl in this case because of its easy to use libraries for automated internet interaction. My original code is this:
#!c:\perl\bin -w use strict; use warnings; use LWP 5.8; my $browser = LWP::UserAgent->new; $browser->timeout(60); push @{ $browser->requests_redirectable }, 'POST'; my $cityInfoMainPageURL = 'http://www.ihoz.com/ilist.html'; my $distanceFinderURL = 'http://www.randmcnally.com/rmc/directions/d +irGetMileageInput.jsp'; my $cities = $browser->get($cityInfoMainPageURL); my $distIn = $browser->get($distanceFinderURL); die ("Can't get $cityInfoMainPageURL -- ", $cities->status_line) unless $cities->is_success; die ("Can't get $distanceFinderURL -- ", $distIn->status_line) unless $distIn->is_success; print ($distIn->base, "\n"); print ($cities->base, "\n"); my $strtCity = 'Miami'; my $strtState = 'FL'; my $destCity = 'Albany'; my $destState = 'NY'; my $cityResponse = $browser->post ( $distanceFinderURL, [ 'txtStartCity' => $strtCity, 'txtStartState' => $strtState, 'txtDestCity' => $destCity, 'txtDestState' => $destState, ] ); die ("error submiting form") unless $cityResponse->is_success; print ($cityResponse->status_line, "\n"); print ($cityResponse->base, "\n"); $cityResponse->content =~ /Driving Distance:.*([1-9][0-9]*|0) miles/; print (1); # this used to be "print ($1)" but the warning # about the undefined variable was annoying
This was mostly a copy/paste/replace operation using a bit of the Perl tutorial off Perl.org and the first two pages from this site. Here is my program's output:
http://www.randmcnally.com/rmc/directions/dirGetMileageInput.jsp http://www.ihoz.com/ilist.html 200 OK http://www.randmcnally.com/rmc/directions/dirGetMileageInput.jsp 1
obviously, most of the output is for debugging the fact that my post command doesn't seem to be posting. To try and narrow the source of the problem, I tried running an altered version of the sample i found here and it produced this output:
http://www.altavista.com/ Couldn't find the match-string in the response
Here is my alteration (the original is the last example at the link above):
use strict; use warnings; use LWP 5.64; my $browser = LWP::UserAgent->new; push @{ $browser->requests_redirectable }, 'POST'; my $word = 'tarragon'; my $url = 'http://www.altavista.com/'; my $response = $browser->post( $url, [ 'q' => $word, # the Altavista query string # 'pg' => 'q', 'avkw' => 'tgz', 'kl' => 'XX', ] ); die "$url error!!: ", $response->status_line unless $response->is_success; die "Weird content type at $url -- ", $response->content_type unless $response->content_type eq 'text/html'; print ($response->base . "\n"); if( $response->content =~ m{(AltaVista|Alta Vista).* .*found.* .*([0 +-9,]+).* .*results} ) { # The substring will be like "AltaVista found 2,345 results" print "$word: $2\n"; } else { print "Couldn't find the match-string in the response\n"; }
Note that the main alterations are for debugging, the switch to a more lenient regex, and an accomodation for the apparent fact that altavista has made some changes since the sample was written. Since it appears to me that so long as the sample is good code my program should work, I suspect that the problem might not be with the program but with the computer. I have ZoneAlarm, MacAffee (or maybe not), and SafeEyes, and am running Windows XP.

Replies are listed 'Best First'.
Re: program fails to get response that should be returned by UserAgent->post
by Cody Pendant (Prior) on Jul 11, 2007 at 05:14 UTC
    Why not print out the content of the page? Just change the last line of your first example to
    print $cityResponse->content;
    and you get the HTML which contains, where you'd expect your information to be, "We're sorry. The page you're trying to access is temporarily unavailable. Please try again later."

    This could just be because they don't want you scraping their data, they want you to look at their ads, or it could be something more fiddly to do with JavaScript or cookies. Anyway, your code seems to be working just fine, only their server doesn't like it.

    I think you should check out WWW::Mechanize, and you should note that the article you're basing your code on is five years old.



    Nobody says perl looks like line-noise any more
    kids today don't know what line-noise IS ...
Re: program fails to get response that should be returned by UserAgent->post
by Cody Pendant (Prior) on Jul 11, 2007 at 05:29 UTC
    This seems to work:
    #!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; my $browser = WWW::Mechanize->new; my $distanceFinderURL = 'http://www.randmcnally.com/rmc/directions/dirGetMileageInput.jsp'; my $distIn = $browser->get($distanceFinderURL); die( "Can't get $distanceFinderURL -- ", $distIn->status_line ) unless $distIn->is_success; $browser->form('frmGetDirections'); $browser->set_fields( 'txtStartCity' => 'Miami', 'txtStartState' => "FL", 'txtDestCity' => 'Albany', 'txtDestState' => 'NY' ); $browser->submit(); $browser->content =~ /Driving Distance:.*?(\d+) miles/; if ($1) { print "distance: $1 miles\n"; } else { print "couldn't find the 'Driving Distance' string on the page\n"; }

    The problem might have to do with some hidden fields in the form. WWW::Mechanize takes care of all that stuff behind the scenes, that's the great thing about it.



    Nobody says perl looks like line-noise any more
    kids today don't know what line-noise IS ...
Re: program fails to get response that should be returned by UserAgent->post
by ww (Archbishop) on Jul 11, 2007 at 14:48 UTC
    Semi-OT, and re your first example code, you wrote, at line 51-52:
    this used to be "print ($1)" but the warning about the undefined variable was annoying

    Yes, a warning that one doesn't understand can be "annoying" but the underlying purpose is what's important. (It is not relevant here, but there are cases where one must turn off warnings to avoid meaningless annoyances, but -- as a general rule -- it's not a good plan to turn them off until you're absolutely sure why you're getting the warning and that turning warnings off for that segment is, in fact, the best way to handle the issue.)

    In this case, your approach of changing the (valid!) variable for a capture, $1, to a literal "1" bears a family-resemblance to turning off warnings: it's an attempt to evade an annoyance without sufficient information about the effect of your change. What that change actually did was deprive you of a crucial clue to the problems.

    After a few other items were changed (partly to acomodate windows; partly because of my inability to find "LWP 5.8" with ppm; and partly to fix failed declarations (like "my $browser....") (the last two caused compile errors when checked with perl -c altavista.pl), the "annoying" warning turned out to be Use of uninitialized value in print at altavista.pl line 53 which is your line 51.

    The greedy .* may be the reason why the capture at your line 49 failed. $1..$9 are special variables for captures. Changing print ($1); to print (1); got rid of the very useful warning, here, by ignoring the attempt to capture data, and printing, instead, a literal "1" -- whcih is not what you were looking for.

    Bottom line: It's not good practice (and defeats Perl's attempts to help you) to evade "annoying" warnings by fiddling with variables you don't understand nor by turning warnings off.