in reply to Thwarting Screen Scrapers

I'm curious if anyone has an experience in protecting a web-based interface from being "front-ended" by others for their own gain.

I've spent a fair amount of time on the other end of this problem, dealing with issues around how to co-navigate web pages that are in some way protected. (This was for a customer service application, where customers' support organization was having to work around roadblocks of the type you're looking to set up, set up internally by the web side of their organizations.)

If you're willing to put some work in on the back end, one way of throwing a spanner in the works of anyone who is hijacking your form submission process without your noticing is to do the following:

When a form is submitted, it's a simple matter to

This leaves the exploiters in the position of either having to come to your site to get a form, or trying to guess your secret key. If they have to come to your site to get a form, you can track and ban them. If your submission form is framed, you can do an automated check using your weblogs for form submissions that aren't matched to a fetch of the framing page. This isn't 100% accurate, but the repeat abuser is who you're looking for.

1You could put the hash into a hidden field instead. I recall there being some reason why having it be part of the URL was advantageous, but don't remember specifics. It might have had to do with getting into into the weblogs for later processing.

Replies are listed 'Best First'.
Re: Re: Thwarting Screen Scrapers
by kschwab (Vicar) on Jul 18, 2002 at 17:27 UTC
    Thanks...this is the kind of input I was looking for.

    Obviously any type of measure has a countermeasure, and if it works on a browser, It would work in LWP ( or some other interface ).

    The addition of a timestamp into the hash calculation is an interesting one.

    We've already worked out a method of using dynamically generated form field names from a hash of the session key. Adding the timestamp purturbs it a bit, and keeps someone from keeping a session alive over a long period of time.

    dws++...thanks again.

      The addition of a timestamp into the hash calculation is an interesting one.

      Interesting, but not what I intended to suggest. Using a timestamp when generating the hash needlessly complicates verification.

      What I meant to suggest was that you save a timestamp when you record generated IDs. This gives you an easy way to "time out" forms, and flush abandoned forms out of your back-end database. It also sets you up for doing some analysis on things like average submit time (the gap between your generating the form, and a user submitting it). A really low submit time is an indication that there's a bot on the other end of the line.