in reply to Re: User tracking
in thread User tracking


I am trying to distinguish bots from people.
It's quite hard to do it because of flexability of libraries such as libwww...
And a presence of JS could'nt be a main sign.

Replies are listed 'Best First'.
Re^3: User tracking
by CountZero (Bishop) on May 18, 2009 at 08:57 UTC
    Interesting ...

    And how would knowing the time someone stays on a page help you in determining whether it is a bot or a human who accessed the page?

    And even more important: why do you need to know this? Do you want to refuse access to bots? Then include a robot.txt file on your server. "Bad bots" will not be stopped of course, but as far as I know no other technology will be able to do so, provided the bad bots are equipped with a modicum of intelligence.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re^3: User tracking
by dsheroh (Monsignor) on May 18, 2009 at 09:56 UTC
    Checking user-agent generally does work pretty well, as does robots.txt.

    If you actually need to identify (rogue) bots which use browser UA strings and ignore robots.txt, your best chance would be to look at the patterns in the timestamps for when pages are requested:

    • Humans will generally either open one page at a time, making single requests at wildly irregular intervals (possibly interspersed with HEAD requests when they use the "back" button), or open everything in tabs, producing flurries of several requests within a few seconds followed by generally longer intervals of few-or-no requests.
    • Bots will tend to request pages at a relatively steady rate - even if they have randomness in their delay, it's rarely more than half the base interval - and often quicker than a human would.
    Don't rely on javascript to make your determination. Some of us use the noscript plugin, which blocks javascript from running unless it comes from a whitelisted site, but we're still not bots.

    Anyhow, though, what are you attempting to accomplish by identifying what's a bot and what isn't?

Re^3: User tracking
by ig (Vicar) on May 18, 2009 at 11:02 UTC

    You might find the visualization presented in O'Reilly's A New Visualization for Web Server Logs interesting. In some cases, automated access will stand out quite clearly, and it may help you determine what criteria you want to use if you want to automate detection.