in reply to Re: screen scraping google
in thread screen scraping google

I have found that you are further limited if you are behind a proxy server shared by other people who may have also registered and used the Google API.

I have also noticed that if you perform a simple query (http://www.google.com/search?q=google+api) using LWP, the page returned directs you to the Google terms and conditions. It would appear that they aren't keen on screen scrapers anyway.

Replies are listed 'Best First'.
Re^3: screen scraping google
by tachyon (Chancellor) on Jul 15, 2004 at 11:01 UTC

    That part is primitive and based on the user-agent string that LWP sends in its headers identifying iteself as lwp/some_version. You will get the expected response if you masquarade as IE but of course that is immoral, possibly illegal but as far as I know not fattening.

    use LWP::UserAgent; use Data::Dumper; my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; compaq) +'); my $request = HTTP::Request->new( 'GET', 'http://www.google.com/search +?q=google+api' ); my $response = $ua->request( $request ); print Dumper $response;

    If you are interested in the throttling see Throttling Apache or mod_throttle or Google for terms like 'throttle bandwidth shaping'

    cheers

    tachyon

Re^3: screen scraping google
by Anonymous Monk on Jul 15, 2004 at 11:50 UTC
    Rather ironic, since they were built by spiders. :)

      Spiders which obey robots.txt though. All they're asking is that others do the same

      # http://www.google.com/robots.txt User-agent: * Disallow: /search Disallow: /groups Disallow: /images Disallow: /catalogs Disallow: /catalog_list Disallow: /news Disallow: /pagead/ Disallow: /relpage/ Disallow: /imgres Disallow: /keyword/ Disallow: /u/ Disallow: /univ/ Disallow: /cobrand Disallow: /custom Disallow: /advanced_group_search Disallow: /advanced_search Disallow: /googlesite Disallow: /preferences Disallow: /setprefs Disallow: /swr Disallow: /url Disallow: /wml Disallow: /hws Disallow: /bsd? Disallow: /linux? Disallow: /mac? Disallow: /microsoft? Disallow: /unclesam? Disallow: /answers/search?q= Disallow: /local Disallow: /froogle? Disallow: /froogle_
      Belden