theloanarranger has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have been working on a program that returns the link count from google and I have come across a problem when using simple_request which is part of HTTP::Request. I found an example here (offsite) and used some of the code. Here is my code:
#!/usr/local/bin/perl -w use strict; use URI; use LWP; foreach my $word (@ARGV) { next unless length $word; # sanity-check my ($content, $status, $is_success) = &do_GET($word); if (!$is_success) { print "Sorry, failed: $status\n"; } elsif ($content =~ /\sof\sabout\s<b>([\d,]+)<\/b>/) { print "$word: $1 matches\n"; } else { print "$word: page not processable\n"; } sleep 2; #being nice to googles server } my $browser; sub do_GET { $browser = LWP::UserAgent->new() unless $browser; $browser->agent('Mozilla/5.0'); my $uri = URI->new('http://www.google.com/search'); $uri->query_form('q' => $_[0]); my $resp = $browser->simple_request (GET $uri); #THIS IS LINE 34 return ($resp->content, $resp->status_line, $resp->is_success, $re +sp) if wantarray; return unless $resp->is_success; return $resp->content; }
Here is my error from the script:
TiB17:~/scripts/perl/lwp theloanarranger$ 2-6 friend Can't locate object method "GET" via package "URI::http" at /Users/the +loanarranger/scripts/perl/lwp/2-6 line 34.

Please help,
-Matt

Replies are listed 'Best First'.
Re: screen scraping google
by tachyon (Chancellor) on Jul 15, 2004 at 01:41 UTC

    Just so you know - if you try to scrape Google at high speed you will get throttled. They run some serious firewalling. You may be better to get and account and use the Web API as suggested. By default this limits you to 1000 reqs per day but you are limited to about 50-100 high speed reqs on the standard web interface before you get throttled into oblivion. I forget how long you are in purgatory.

    cheers

    tachyon

      I have found that you are further limited if you are behind a proxy server shared by other people who may have also registered and used the Google API.

      I have also noticed that if you perform a simple query (http://www.google.com/search?q=google+api) using LWP, the page returned directs you to the Google terms and conditions. It would appear that they aren't keen on screen scrapers anyway.

        That part is primitive and based on the user-agent string that LWP sends in its headers identifying iteself as lwp/some_version. You will get the expected response if you masquarade as IE but of course that is immoral, possibly illegal but as far as I know not fattening.

        use LWP::UserAgent; use Data::Dumper; my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; compaq) +'); my $request = HTTP::Request->new( 'GET', 'http://www.google.com/search +?q=google+api' ); my $response = $ua->request( $request ); print Dumper $response;

        If you are interested in the throttling see Throttling Apache or mod_throttle or Google for terms like 'throttle bandwidth shaping'

        cheers

        tachyon

        Rather ironic, since they were built by spiders. :)
Re: screen scraping google
by borisz (Canon) on Jul 15, 2004 at 01:14 UTC
    You need to include
    use HTTP::Request::Common;
    into your code to import the GET function.
    Boris
Re: screen scraping google
by beable (Friar) on Jul 15, 2004 at 00:49 UTC
    Hmmm, I get that error too. However, replacing "LINE 34" with this seems to work:

        my $resp = $browser->get($uri);

    Also! May I suggest that you take a look at the WWW::Mechanize module on CPAN?

      Also! May I suggest that you take a look at the WWW::Mechanize module on CPAN?
      Depending upon your requirements, you may be better recommended to employ the Google Web API using SOAP::Lite - There is an example of employing this interface with SOAP::Lite both in the examples directory of the SOAP::Lite package (in fact, this example returns the number of search results in the same manner which you are attempting to achieve with your code) and at http://hacks.oreilly.com/pub/h/170

       

      perl -le "print unpack'N', pack'B32', '00000000000000000000001011101011'"

        Or you could make life even easier and use Net::Google (which abstracts the SOAP::Lite for you).
        my @a=qw(random brilliant braindead); print $a[rand(@a)];