THanks, I was actually just using that as an example - but just as a matter of interest, how would google know if someone was automating searches to its engine - does it check for the I.P.s that visit google the most. | [reply] |
Checking IPs is an iffy solution at best. Since HTTP is essentially stateless, they can't prove that anyone individual is necessarily guilty of anything. However, they recently requested that a module be removed from the CPAN for violating their terms of service. I think that's fair. I think as a matter of being a good netizen, it's appropriate to respect their restrictions.
Here's the relevant section from their terms of service (which is why you should use their SOAP interface - it's really easy).
No Automated Querying
You may not send automated
queries of any sort to Google's system without express
permission in advance from Google. Note that "sending
automated queries" includes, among other things:
- using any software which sends queries to
Google to determine how a website or webpage "ranks" on Google
for various queries;
- "meta-searching" Google; and
- performing "offline" searches on Google.
Please do not write to Google to request
permission to "meta-search" Google for a research project, as
such requests will not be granted.
Cheers,
Ovid
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.
| [reply] |
There are lots of things sites can do to determine if they
want to block traffic, IPs are only one example (and
something that's not common unless the admins believe
that IP has deliberately attempt to DOS them, or in some
other way jepordize their site. (ie: if you try to crawl
http://shopping.yahoo.com/ to get all of their product
data to build your own shopping portal, they will probably
block your IP, wether you are doing it in a very low
intensity way or not).
More generally, sites can analyze the "signature" of
requests to identify if they want to block you or not. By
signature i mean anything that can make your requests
stand out from those of the other 99% of their traffic.
They might do it based on your User-Agent, or some other
HTTP header that is unique to the API you are using, or
they might do it based on some combination of things that
help identify people who are being decieptful (if your
User-Agent says you're Netscape 6, but you use "HTTP/1.0",
that's a dead give away ... other more subtle things might
be descrepencies in what HTTP headers you send vs. the headers
that Netscape 6 ALLWAYS sends.
Bottom line: play nice. If you get blocked, you probably
deserved it.
| [reply] |