gemoroy has asked for the wisdom of the Perl Monks concerning the following question:

Hi, monks! May be this is not a place for that question to be posted, but are there some ways of tracking a time that user stayed on a page, exept sessions and sending query to db from AJAX?

Replies are listed 'Best First'.
Re: User tracking
by CountZero (Bishop) on May 18, 2009 at 07:57 UTC
    The server will normally never be informed when the user "leaves" (for any definition of "leave") the page. Usually I have several tabs open within my browser. Am I "staying" on all these pages all of the time? What if I open another browser? Have I then "left" all the pages in the previous browser?

    The best you can hope to achieve with a little bit of javascript is to get notified when the user "closes" your page, but other than telling you that the page is now no longer showing on the user's system, such message has --IMHO-- nothing significant to tell you.

    What are you really trying to achieve?

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James


      I am trying to distinguish bots from people.
      It's quite hard to do it because of flexability of libraries such as libwww...
      And a presence of JS could'nt be a main sign.
        Interesting ...

        And how would knowing the time someone stays on a page help you in determining whether it is a bot or a human who accessed the page?

        And even more important: why do you need to know this? Do you want to refuse access to bots? Then include a robot.txt file on your server. "Bad bots" will not be stopped of course, but as far as I know no other technology will be able to do so, provided the bad bots are equipped with a modicum of intelligence.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        Checking user-agent generally does work pretty well, as does robots.txt.

        If you actually need to identify (rogue) bots which use browser UA strings and ignore robots.txt, your best chance would be to look at the patterns in the timestamps for when pages are requested:

        • Humans will generally either open one page at a time, making single requests at wildly irregular intervals (possibly interspersed with HEAD requests when they use the "back" button), or open everything in tabs, producing flurries of several requests within a few seconds followed by generally longer intervals of few-or-no requests.
        • Bots will tend to request pages at a relatively steady rate - even if they have randomness in their delay, it's rarely more than half the base interval - and often quicker than a human would.
        Don't rely on javascript to make your determination. Some of us use the noscript plugin, which blocks javascript from running unless it comes from a whitelisted site, but we're still not bots.

        Anyhow, though, what are you attempting to accomplish by identifying what's a bot and what isn't?

        You might find the visualization presented in O'Reilly's A New Visualization for Web Server Logs interesting. In some cases, automated access will stand out quite clearly, and it may help you determine what criteria you want to use if you want to automate detection.

Re: User tracking
by ELISHEVA (Prior) on May 18, 2009 at 11:49 UTC

    There are two techniques I've been using recently to identify automated visitors and nasties.

    The first is simply to do a reverse DNS lookup. The legitimate bots (msn, google, yahoo) have registered the IP address of their bots and often include the word "bot" in the name (e.g. google bot). That is the easy way.

    Of course, the illegimate spider or reckless wget user is not going to be so obliging. Those annoying visitors will normally have domain names indicating a dynamic IP address, or even no reverse DSN lookup at all! For these, I use a script I wrote that looks for certain behavioral patterns.

    Humans and bots are trying to do different things on a site and so they behave differently. Human users who spend a long time on the site tend to visit selected pages and may visit them repeatedly. It takes a certain physical amount of time to move from page to page so the number of hits per minute should be much less than a bot. Human beings also tend to visit content pages and items linked directly to those pages.

    An IP address that is hitting your site with requests 100x a minute or is visiting every page on your site just once (or doing both at the same time!) is most likely *not* human. So I look first for IP addresses that have contributed heavily to bursts in traffic. I also look for IP addresses that have visited large numbers of pages or are systemically visiting pages that are supposed to be off limits to robots or of little interest to human visitors.

    Knowing that a certain IP address is a bot or spider doesn't necessarily buy you much. If you are trying to improve statistics used for marketing, I suppose you can just eliminate the probable bots from your stats. However, if your goal is security, I'm not sure knowing that an IP is a bot is going to help you much.

    Dynamic IP addresses shift around so blocking Mr. Bad Guy today at IP xxx.xxx.xxx.xxx today may block Mr Good Guy to tomorrow. To block such IP addresses you would probably need some type of software that allows you to expire the block based on the length of time since the bad behavior occured.

    Best, beth

Re: User tracking
by moritz (Cardinal) on May 18, 2009 at 08:06 UTC
    Usually when analyzing log files you assume that each unique combination of user agent and IP is one user. You can analyze your log files on a day by day basis and simply subtract the timestamp of the first visit from that of the last visit, and get some measure for the time somebody stays if he loads at least two different pages.

    That will give you some kind of flawed measure, but it's enough to give you a rough idea.

      I always remember the day one of our major customers came to us, furiously demanding to know what we had done to block all their users. Their number of "visits" had dropped by about 80% and they were sure we had done something to prevent their users accessing the site. They also wanted us to track down one rogue user that now accounted for almost all the traffic to their site. That user, it turned out, had the IP address of our new proxy server. Flawed is a key concept in log analysis.

      That will give you some kind of flawed measure,
      The emphasis of course being on flawed. :-)

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: User tracking
by Anonymous Monk on May 18, 2009 at 07:39 UTC
    The only way to track users is with chip implants.
      Depending on the situation (eg tracking people within a single building) a long peice of string may also suffice.

      --
      use JAPH;
      print JAPH::asString();