dws has asked for the wisdom of the Perl Monks concerning the following question:

Folks, I need some enlightenment.

I have a web log analysis script that threads together "sessions" and reports on them in blocks, so that I can get some idea of what paths individuals take when they visit one of my sites.

For my purposes, a session is a set of visits from a unique hostname/IP address, ending after a certain time has passed since the last visit. This works pretty well, except for people coming in through AOL proxies, where an actual session can be spread across different proxy servers. This blows the idea of using the hostname/IP address as a hash key for tracking ongoing sessions. I've tried a couple of techniques for threading these sessions together, including using the user-agent string to disambiguate simulataneous AOL sessions, but I'm not happy with the results.

Have you run into this problem and come up with a satisfactory approach? Or have you run across an article that deals with this problem? (I've checked merlyn's columns.) Thanks in advance for any insights/pointers you can provide.

  • Comment on Threading together "sessions" from browser logs

Replies are listed 'Best First'.
Re: Threading together "sessions" from browser logs
by Zaxo (Archbishop) on Sep 20, 2002 at 18:18 UTC

    Apache's mod_usertrack gives you the ability to log session cookies.

    Update: Do you have mod_log_config available? It has a deprecated interface to cookies.

    After Compline,
    Zaxo

      Yes, that's typically the way to do it. Or you can just roll your own. All this module does is send a cookie with a unique ID to anyone who comes in without one. Then you use a custom log format to log the cookie as part of each access, giving you a unique ID to trace. People can still browse the site if they don't accept the cookie, so it's not a terrible thing to do.
      Apache's mod_usertrack gives you the ability to log session cookies.

      If I move the site to a colo box, that's what I'll do. For the moment, though, it's hosted at an ISP that doesn't support mod_usertrack, so I'm looking for downstream (logfile-based) options.

•Re: Threading together "sessions" from browser logs
by merlyn (Sage) on Sep 20, 2002 at 18:46 UTC
    Lincoln Stein wrote some code once that detected "robots" by looking at weblogs: looking for long sessions from identical IP addresses crossed with UserAgent strings. I don't recall where I got the code now, but I know it was interesting in terms of analysis.

    It might have been in the mod_perl book, or perhaps his Net::* book.

    -- Randal L. Schwartz, Perl hacker

Re: Threading together "sessions" from browser logs
by sauoq (Abbot) on Sep 21, 2002 at 01:14 UTC

    I agree that mod_usertrack is the Right Way™ to do it; it's too bad you don't have that luxury.

    Are you sure you need that sort of granularity though?

    If the site gets a reasonable amount of traffic and you log referrers you might consider just analyzing the aggregated data. Unless you are looking for something very specific or need it for debugging, it's likely that you'll find yourself munging your click trail data down to something that looks like referrer logs in order to gather statistics on it anyway.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Threading together "sessions" from browser logs
by particle (Vicar) on Sep 20, 2002 at 18:21 UTC
    i'm by no means an expert on this subject, but i wonder why you're not logging a session id (that is, if such a thing exists for your app.)

    perhaps you can clarify your definition of "session" a bit.

    ~Particle *accelerates*

      A "session" is the sequence of pages that a given user visits during a browser session. I'm interested in looking at the paths people take through the site, so that I can answer some basic usability questions based on what percentage of users do certain things. Since a "browser session" is hard to quantify, I've arbitrarily decided that a new session starts when N minutes have elapsed since the last page fetch.

      Much of the site is static, leaving no easy way to track a session id. And the site is hosted, cutting off the opportunity of doing nifty things with Apache. I have to rely on server logs.

Re: Threading together "sessions" from browser logs
by blm (Hermit) on Sep 21, 2002 at 17:33 UTC

    You say you can only go on what is in the logs. Can you tell us what your web server is logging? It might be helpful to your cause if the referrer is being logged. This is the way apache comes on my debian machine.

      Can you tell us what your web server is logging? It might be helpful to your cause if the referrer is being logged.

      Standard, out-of-the-box Apache logs. I do notice that I'm not getting referrer info in all cases, and suspect that there may be firewalls/proxies that are stripping it off in some cases, so I can't reliably use the referrer to chain together requests.