Yes, they can -- but if someone's scraping the site, they'd have been referred by the site in question to get to the image.
Checking HTTP_REFERER is for those cases when someone from another website decides to link directly to an image (and/or page) on your site. Back in the early days of HTTP (ie, 0.9, before there was such a thing as HTTP_REFERER), it was common for people to link to our imagemap and counter CGIs that ran on the server that I maintained -- they didn't care, and there was no real way to stop them.
Likewise, people would find an image they liked (a bullet, some animated gif, whatever), and would link directly to it, sucking down your bandwidth. (the university where I worked only had a T1 in 1994)
These days, however, when people check HTTP_REFERER, it's not to stop bots -- it's to stop people from linking directly to the images, so that other people visiting their site use someone else's bandwidth. As they don't have control over the other people's browsers, checking HTTP_REFERER can be a very effective way to cut down on abuse -- however, as not all browsers send HTTP_REFERER, you have to make sure that the null case is to allow the download.
...
I'm also surprised that no one's mentioned checking X_FORWARDED_FOR to check for proxies (which should have identified the issue w/ AOL, as well as SQUID and quite a few other proxies) ... there were also some proposals floating about for changing the robot exclusion standards to specify rate limiting and visiting hours, but it's been a decade, and I've never seen any widespread support for them
In reply to Re^3: blocking site scrapers
by jhourcle
in thread blocking site scrapers
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |