Re: Thwarting Screen Scrapers

I'm curious if anyone has an experience in protecting a web-based interface from being "front-ended" by others for their own gain.

I've spent a fair amount of time on the other end of this problem, dealing with issues around how to co-navigate web pages that are in some way protected. (This was for a customer service application, where customers' support organization was having to work around roadblocks of the type you're looking to set up, set up internally by the web side of their organizations.)

If you're willing to put some work in on the back end, one way of throwing a spanner in the works of anyone who is hijacking your form submission process without your noticing is to do the following:

When you generate the form, allocate an ID and record it on the backend (e.g., in a database, along with a timestamp that you can use to time the form out).
Generate an MD5 hash based on the ID and some secret key known only to your application. Add this hash as an argument to the form action URL¹ (i.e., add "?key=$hash").
Put the ID into the form in a hidden field.

When a form is submitted, it's a simple matter to

Check to see if the ID has been used already. This prevents them from grabbing one legitimate key/ID pair and reusing it.
Check to see if the ID has expired (if you care)
Generate a new hash based on the ID and your secret key, and compare to the one in param('key')

This leaves the exploiters in the position of either having to come to your site to get a form, or trying to guess your secret key. If they have to come to your site to get a form, you can track and ban them. If your submission form is framed, you can do an automated check using your weblogs for form submissions that aren't matched to a fetch of the framing page. This isn't 100% accurate, but the repeat abuser is who you're looking for.

¹You could put the hash into a hidden field instead. I recall there being some reason why having it be part of the URL was advantageous, but don't remember specifics. It might have had to do with getting into into the weblogs for later processing.

Comment on Re: Thwarting Screen Scrapers Download Code

Replies are listed 'Best First'.
Re: Re: Thwarting Screen Scrapers by kschwab (Vicar) on Jul 18, 2002 at 17:27 UTC
Thanks...this is the kind of input I was looking for. Obviously any type of measure has a countermeasure, and if it works on a browser, It would work in LWP ( or some other interface ). The addition of a timestamp into the hash calculation is an interesting one. We've already worked out a method of using dynamically generated form field names from a hash of the session key. Adding the timestamp purturbs it a bit, and keeps someone from keeping a session alive over a long period of time. dws++...thanks again.	[reply]
Re: Re: Re: Thwarting Screen Scrapers by dws (Chancellor) on Jul 18, 2002 at 17:40 UTC
The addition of a timestamp into the hash calculation is an interesting one. Interesting, but not what I intended to suggest. Using a timestamp when generating the hash needlessly complicates verification. What I meant to suggest was that you save a timestamp when you record generated IDs. This gives you an easy way to "time out" forms, and flush abandoned forms out of your back-end database. It also sets you up for doing some analysis on things like average submit time (the gap between your generating the form, and a user submitting it). A really low submit time is an indication that there's a bot on the other end of the line.	[reply]