User tracking

gemoroy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: User tracking by CountZero (Bishop) on May 18, 2009 at 07:57 UTC
The server will normally never be informed when the user "leaves" (for any definition of "leave") the page. Usually I have several tabs open within my browser. Am I "staying" on all these pages all of the time? What if I open another browser? Have I then "left" all the pages in the previous browser? The best you can hope to achieve with a little bit of javascript is to get notified when the user "closes" your page, but other than telling you that the page is now no longer showing on the user's system, such message has --IMHO-- nothing significant to tell you. What are you really trying to achieve? CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^2: User tracking by gemoroy (Beadle) on May 18, 2009 at 08:30 UTC
I am trying to distinguish bots from people. It's quite hard to do it because of flexability of libraries such as libwww... And a presence of JS could'nt be a main sign.	[reply]
Re^3: User tracking by CountZero (Bishop) on May 18, 2009 at 08:57 UTC
Interesting ... And how would knowing the time someone stays on a page help you in determining whether it is a bot or a human who accessed the page? And even more important: why do you need to know this? Do you want to refuse access to bots? Then include a `robot.txt` file on your server. "Bad bots" will not be stopped of course, but as far as I know no other technology will be able to do so, provided the bad bots are equipped with a modicum of intelligence. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]
Re^3: User tracking by dsheroh (Monsignor) on May 18, 2009 at 09:56 UTC
Checking user-agent generally does work pretty well, as does robots.txt. If you actually need to identify (rogue) bots which use browser UA strings and ignore robots.txt, your best chance would be to look at the patterns in the timestamps for when pages are requested: Humans will generally either open one page at a time, making single requests at wildly irregular intervals (possibly interspersed with HEAD requests when they use the "back" button), or open everything in tabs, producing flurries of several requests within a few seconds followed by generally longer intervals of few-or-no requests. Bots will tend to request pages at a relatively steady rate - even if they have randomness in their delay, it's rarely more than half the base interval - and often quicker than a human would. Don't rely on javascript to make your determination. Some of us use the noscript plugin, which blocks javascript from running unless it comes from a whitelisted site, but we're still not bots. Anyhow, though, what are you attempting to accomplish by identifying what's a bot and what isn't?	[reply]
Re^3: User tracking by ig (Vicar) on May 18, 2009 at 11:02 UTC
You might find the visualization presented in O'Reilly's A New Visualization for Web Server Logs interesting. In some cases, automated access will stand out quite clearly, and it may help you determine what criteria you want to use if you want to automate detection.	[reply]
Re: User tracking by ELISHEVA (Prior) on May 18, 2009 at 11:49 UTC
There are two techniques I've been using recently to identify automated visitors and nasties. The first is simply to do a reverse DNS lookup. The legitimate bots (msn, google, yahoo) have registered the IP address of their bots and often include the word "bot" in the name (e.g. google bot). That is the easy way. Of course, the illegimate spider or reckless wget user is not going to be so obliging. Those annoying visitors will normally have domain names indicating a dynamic IP address, or even no reverse DSN lookup at all! For these, I use a script I wrote that looks for certain behavioral patterns. Humans and bots are trying to do different things on a site and so they behave differently. Human users who spend a long time on the site tend to visit selected pages and may visit them repeatedly. It takes a certain physical amount of time to move from page to page so the number of hits per minute should be much less than a bot. Human beings also tend to visit content pages and items linked directly to those pages. An IP address that is hitting your site with requests 100x a minute or is visiting every page on your site just once (or doing both at the same time!) is most likely not human. So I look first for IP addresses that have contributed heavily to bursts in traffic. I also look for IP addresses that have visited large numbers of pages or are systemically visiting pages that are supposed to be off limits to robots or of little interest to human visitors. Knowing that a certain IP address is a bot or spider doesn't necessarily buy you much. If you are trying to improve statistics used for marketing, I suppose you can just eliminate the probable bots from your stats. However, if your goal is security, I'm not sure knowing that an IP is a bot is going to help you much. Dynamic IP addresses shift around so blocking Mr. Bad Guy today at IP xxx.xxx.xxx.xxx today may block Mr Good Guy to tomorrow. To block such IP addresses you would probably need some type of software that allows you to expire the block based on the length of time since the bad behavior occured. Best, beth	[reply]
Re: User tracking by moritz (Cardinal) on May 18, 2009 at 08:06 UTC
Usually when analyzing log files you assume that each unique combination of user agent and IP is one user. You can analyze your log files on a day by day basis and simply subtract the timestamp of the first visit from that of the last visit, and get some measure for the time somebody stays if he loads at least two different pages. That will give you some kind of flawed measure, but it's enough to give you a rough idea.	[reply]
Re^2: User tracking by ig (Vicar) on May 18, 2009 at 10:58 UTC
I always remember the day one of our major customers came to us, furiously demanding to know what we had done to block all their users. Their number of "visits" had dropped by about 80% and they were sure we had done something to prevent their users accessing the site. They also wanted us to track down one rogue user that now accounted for almost all the traffic to their site. That user, it turned out, had the IP address of our new proxy server. Flawed is a key concept in log analysis.	[reply]
Re^2: User tracking by CountZero (Bishop) on May 18, 2009 at 09:03 UTC
That will give you some kind of flawed measure, The emphasis of course being on flawed. :-) CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re: User tracking by Anonymous Monk on May 18, 2009 at 07:39 UTC
The only way to track users is with chip implants.	[reply]
Re^2: User tracking by wol (Hermit) on May 18, 2009 at 14:15 UTC
Depending on the situation (eg tracking people within a single building) a long peice of string may also suffice. -- use JAPH; print JAPH::asString();	[reply]