in reply to screen scraping google

Just so you know - if you try to scrape Google at high speed you will get throttled. They run some serious firewalling. You may be better to get and account and use the Web API as suggested. By default this limits you to 1000 reqs per day but you are limited to about 50-100 high speed reqs on the standard web interface before you get throttled into oblivion. I forget how long you are in purgatory.

cheers

tachyon

Replies are listed 'Best First'.
Re^2: screen scraping google
by inman (Curate) on Jul 15, 2004 at 08:21 UTC
    I have found that you are further limited if you are behind a proxy server shared by other people who may have also registered and used the Google API.

    I have also noticed that if you perform a simple query (http://www.google.com/search?q=google+api) using LWP, the page returned directs you to the Google terms and conditions. It would appear that they aren't keen on screen scrapers anyway.

      That part is primitive and based on the user-agent string that LWP sends in its headers identifying iteself as lwp/some_version. You will get the expected response if you masquarade as IE but of course that is immoral, possibly illegal but as far as I know not fattening.

      use LWP::UserAgent; use Data::Dumper; my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; compaq) +'); my $request = HTTP::Request->new( 'GET', 'http://www.google.com/search +?q=google+api' ); my $response = $ua->request( $request ); print Dumper $response;

      If you are interested in the throttling see Throttling Apache or mod_throttle or Google for terms like 'throttle bandwidth shaping'

      cheers

      tachyon

      Rather ironic, since they were built by spiders. :)

        Spiders which obey robots.txt though. All they're asking is that others do the same

        # http://www.google.com/robots.txt User-agent: * Disallow: /search Disallow: /groups Disallow: /images Disallow: /catalogs Disallow: /catalog_list Disallow: /news Disallow: /pagead/ Disallow: /relpage/ Disallow: /imgres Disallow: /keyword/ Disallow: /u/ Disallow: /univ/ Disallow: /cobrand Disallow: /custom Disallow: /advanced_group_search Disallow: /advanced_search Disallow: /googlesite Disallow: /preferences Disallow: /setprefs Disallow: /swr Disallow: /url Disallow: /wml Disallow: /hws Disallow: /bsd? Disallow: /linux? Disallow: /mac? Disallow: /microsoft? Disallow: /unclesam? Disallow: /answers/search?q= Disallow: /local Disallow: /froogle? Disallow: /froogle_
        Belden