As others have alluded to, there are basically only two things you can do to prevent people from programmatically accessing your site:

1) Ask them not to. Your ToS is a means of asking the user not to do this and robots.txt asks the program itself to pass over (parts of) your site. But you have no guarantee that the user will read the ToS nor that the program will read robots.txt and, even if they are read, they may be ignored.

2) Threaten to sue anyone who doesn't use the site in the way that you prefer. While this can be very effective at preventing automated use of your site (if you have the money to spend on lawyers), it is generally more effective at preventing manual use, as many of us prefer not to deal with litigation-happy sites.

Automated use is not the actual problem you're facing anyhow, so you would do better to just accept that and deal with the real problems, namely invalid data and high server load.

For the first, you must, must, must validate the data received from the client. Rule one of designing a networked application is to assume that the other end of the connection may be lying to you and clean up, sanity-check, and otherwise validate all received data. This isn't even just for websites - there's a long history of networked games, from Doom and Quake to the latest MMOs, which have had massive problems with cheating because they foolishly trusted the user's software, probably in the belief that they had a "secret" protocol which only one other program (the game client) knew how to talk. HTTP is very simple and very well-known. Writing a dishonest HTTP client is trivial.

There are several options for dealing with load issues, ranging from limiting the number of server processes allowed to spawn at any one time to caching results and returning them as static pages instead of reprocessing data on every request to blocking IP addresses that issue too many requests too quickly and many other things in between. Or you could just ask the users who are writing robots for your site to please configure the robots to issue no more than, say, one request every 5 seconds. Which option is best for you is highly situation-dependent.


In reply to Re: How to stop web interface bypassing? by dsheroh
in thread How to stop web interface bypassing? by advait

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.