in reply to How to stop web interface bypassing?

As others have alluded to, there are basically only two things you can do to prevent people from programmatically accessing your site:

1) Ask them not to. Your ToS is a means of asking the user not to do this and robots.txt asks the program itself to pass over (parts of) your site. But you have no guarantee that the user will read the ToS nor that the program will read robots.txt and, even if they are read, they may be ignored.

2) Threaten to sue anyone who doesn't use the site in the way that you prefer. While this can be very effective at preventing automated use of your site (if you have the money to spend on lawyers), it is generally more effective at preventing manual use, as many of us prefer not to deal with litigation-happy sites.

Automated use is not the actual problem you're facing anyhow, so you would do better to just accept that and deal with the real problems, namely invalid data and high server load.

For the first, you must, must, must validate the data received from the client. Rule one of designing a networked application is to assume that the other end of the connection may be lying to you and clean up, sanity-check, and otherwise validate all received data. This isn't even just for websites - there's a long history of networked games, from Doom and Quake to the latest MMOs, which have had massive problems with cheating because they foolishly trusted the user's software, probably in the belief that they had a "secret" protocol which only one other program (the game client) knew how to talk. HTTP is very simple and very well-known. Writing a dishonest HTTP client is trivial.

There are several options for dealing with load issues, ranging from limiting the number of server processes allowed to spawn at any one time to caching results and returning them as static pages instead of reprocessing data on every request to blocking IP addresses that issue too many requests too quickly and many other things in between. Or you could just ask the users who are writing robots for your site to please configure the robots to issue no more than, say, one request every 5 seconds. Which option is best for you is highly situation-dependent.

  • Comment on Re: How to stop web interface bypassing?