in reply to Automating a search using perl

Well, you can use LWP::Simple to get the page and HTML::LinkExtor to get the links, but http://www.google.com/ has been objecting to people doing this to much as it can be a waste of resources on their server. What you can do is use the new SOAP interface that they have developed. ActiveState has released a module to use that service. See the above link for details.

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Replies are listed 'Best First'.
Re: Re: Automating a search using perl
by Baz (Friar) on Apr 29, 2002 at 19:57 UTC
    THanks, I was actually just using that as an example - but just as a matter of interest, how would google know if someone was automating searches to its engine - does it check for the I.P.s that visit google the most.

      Checking IPs is an iffy solution at best. Since HTTP is essentially stateless, they can't prove that anyone individual is necessarily guilty of anything. However, they recently requested that a module be removed from the CPAN for violating their terms of service. I think that's fair. I think as a matter of being a good netizen, it's appropriate to respect their restrictions.

      Here's the relevant section from their terms of service (which is why you should use their SOAP interface - it's really easy).


      No Automated Querying

      You may not send automated queries of any sort to Google's system without express permission in advance from Google. Note that "sending automated queries" includes, among other things:

      • using any software which sends queries to Google to determine how a website or webpage "ranks" on Google for various queries;
      • "meta-searching" Google; and
      • performing "offline" searches on Google.

      Please do not write to Google to request permission to "meta-search" Google for a research project, as such requests will not be granted.


      Cheers,
      Ovid

      Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

      There are lots of things sites can do to determine if they want to block traffic, IPs are only one example (and something that's not common unless the admins believe that IP has deliberately attempt to DOS them, or in some other way jepordize their site. (ie: if you try to crawl http://shopping.yahoo.com/ to get all of their product data to build your own shopping portal, they will probably block your IP, wether you are doing it in a very low intensity way or not).

      More generally, sites can analyze the "signature" of requests to identify if they want to block you or not. By signature i mean anything that can make your requests stand out from those of the other 99% of their traffic. They might do it based on your User-Agent, or some other HTTP header that is unique to the API you are using, or they might do it based on some combination of things that help identify people who are being decieptful (if your User-Agent says you're Netscape 6, but you use "HTTP/1.0", that's a dead give away ... other more subtle things might be descrepencies in what HTTP headers you send vs. the headers that Netscape 6 ALLWAYS sends.

      Bottom line: play nice. If you get blocked, you probably deserved it.