Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

One script I have always wanted to write is one that will search through a log file created by a proxy server or firewall and give a basic report of how long each user spent browsing the web. After messing with Perl for a year or so, I could now probably figure out how to write the code but I can't quite figure out the logic. My problem comes from the information most logs contain. They simply have an entry with the URL visisted by a user as well as other data including the date and time it was visited, but not the length of time spent at that URL. Then, when the user changes addresses that new URL is logged and and so on. The problem I am struggling with is how to come up with a decent way of taking that information and adding it together to come up with a user's total time spent on the web. Keep in mind that while a user is looking at one site, the log may get several entries from other users going to other URL's. Also, the server doen't log anything when the user simply closes his browser? In other words, its not just as simple as adding those times together. What if the user quit browsing at 9:00 a.m. and starting again at 2:00 p.m.?

Maybe someone has seen such a script that might lead me in the right direction?

Replies are listed 'Best First'.
Re: Perl Programming Logic
by VSarkiss (Monsignor) on Jul 01, 2002 at 18:49 UTC

    This isn't really a Perl problem, it's a data problem.

    You can't really tell how long someone spent reading a web page from looking at logs, in general. The problem is that the log only has the times the browser sent out a GET or POST or other HTTP request, and when the web site responded (generally). You can't tell from that how long someone "interacted" with a site. I can download a Java or Flash game from a site with one GET, then spend several hours playing with nothing getting logged. Similarly, I can retrieve a single page, close the browser, and retrieve another page several hours later. The log can't tell you that I didn't even have the browser open in between those two times.

Re: Perl Programming Logic
by caedes (Pilgrim) on Jul 01, 2002 at 18:50 UTC
    It seems that your understanding of the client-server communications for HTTP are a bit inaccurate. All that the server sees when a user goes to the site is "give me xxxxxx" followed possibly by "Give me yyyyyyy". You might interpret that to mean that the user spend the time between viewing xxxx and yyyyy looking at the page, but that isn't neccesarily the case. Another point is that it is impossible to tell when a user "closed the browser", however you can assume that they have left your site after they don't request a document for a given length of time. The whole here being that you have to make reasonable assumptions in order to hopefully get an idea about what might have happened.

    As for solving your problem, I would probably split the log files up by IP address (which may or may not stay the same for a given user, but that is another discussion). then set a length of time that you consider to be too long to view one page, say 1 hour. Then whenever one hour goes by for a single user between page request you interpret that to mean the person quit surfing.

    I hope this helps you out some. ;-)

    -caedes

      By the way, I know this must be possible as I used to work for a company that used a product (relatively expensive, I might add) called "WebTrends for firewalls & VPN's" that would do exactly what I would like to do.
        It is trivial to produce a number and claim it means something.

        It is much harder to produce a number that really means what you have claimed.

        The fact that a proprietary product claims to accomplish a goal is not always very good evidence that that goal is, in fact, technically accomplishable.

Re: Perl Programming Logic
by newrisedesigns (Curate) on Jul 01, 2002 at 19:57 UTC

    You'll need to either get a better tracking system (other than a log), or you'll have to get creative.

    For tracking, might I suggest first pulling out all instances of each user and dumping them into one group. Put all your 192.168.1.214's into one group, and all your other IPs into their own groups. Then, check all the URLs of each person. To make it easier, you could strip out requests for known places like ads.x10.com, seeing that that request is most likely a pop-up and not an intended request.

    Creativity steps in here. You will have to assume that a user that requests /index.html /left.html /right.html in 5 seconds just loaded a frameset. Consider that 1 request. Now your user gets /index.html, then /blue.html, then /red.html, then /green.html over a 3 minute period. Four requests, that did not occur in a few-second time frame. The user should be considered "surfing" for those 3 minutes, because those requests are more than likely requests made by a human.

    Now bringing relative time into the situation can cause some headaches, too. When I surf, I usually have 3-10 windows open. You will need to find some way to distinguish a clicked link from a cold request and a method to determine what kind of information is held on the requested page. You could do that with LWP, can read through the file to see if there's tags for Shockwave games, streaming video, large amounts of text (online books) and draw conclusions from those results and your request log.

    It can be done, but it will be difficult without some other form of recording information. If you have the resources, you could set up a webproxy that monitors clicked links, and observe information that way. You could avoid Perl altogether, install BackOrifice on the machine you want to monitor and watch what your users are viewing.

    Anonymonk, might I suggest that you create a user name and stay awhile. :)

    John J Reiser
    newrisedesigns.com

Re: Perl Programming Logic
by perigeeV (Hermit) on Jul 01, 2002 at 19:08 UTC

    To accurately track a person's click path you need more than logs. Log information cannot differentiate between users sharing a proxy, or different users that have been sequentially assigned the same communal IP adddress, like a dialup ISP user.

    You can assign a session ID to a user and track that ID number. Super Search for "maintaining state" or some such.

    To really know how long someone is viewing a page you would have to use a client refresh at timed intervals. For instance, you could have some javascript that updates a dummy one-pixel image at regular intervals. The refresh would include that users unique ID, thus you just sum the times between each refresh.


Re: Perl Programming Logic
by Abigail-II (Bishop) on Jul 02, 2002 at 10:12 UTC
    Your biggest problem is not the logic of the problem, but the logic of what you want to do.

    You are data mining over HTTP log files. HTTP is essentially a stateless, sessionless protocol. Yet you want to measure the length of "sessions" somehow.

    You're up for a failure. You want to measure something that isn't really there. And it isn't just "quitting browsing" that will spoil your day. When I hit "preview" in a minute, it's likely that it takes a while before the request goes to perlmonks, perlmonks does what it wants to do, the request is back, the ad has been fetched and the page is displayed. I'll switch to IRC, p5p or do some actual work before I return my attention to the preview page. There might be 20 minutes between hitting 'preview' and 'submit' on the next page. Did I "browse" for 20 minutes? No, I probably won't even spend 20 seconds.

    Oh, did I mention that the user name for the proxy I'm using is shared with a whole bunch of people, and that we're rotating between several proxies? That would really screw up your analysis, wouldn't it? ;-)

    My suggestion: give up on the idea. It's utterly useless, the data you have can't measure what you want to measure, and what you want to measure doesn't have much connection to what you want to know anyway.

    Abigail

Re: Perl Programming Logic
by grantm (Parson) on Jul 02, 2002 at 11:48 UTC

    The author of Analog has written an article on how the web works and what can and can't be determined by analysing logs.

    I have spent quite a lot of time working with both Analog and WebTrends. I recommend the former (with Report Magic) for people who want to understand site usage patterns and the latter for people who have lots of money and a willingness to base business decisions on nicely presented but completely meaningless numbers. Even when I phrase it exactly like that, it's amazing how many clients go for the latter.

Re: Perl Programming Logic
by arc_of_descent (Hermit) on Jul 02, 2002 at 13:59 UTC

    Hi,
    If the time spent in retrieving the particular page
    justifies your meaning of time spent in viewing
    the page, then the value could be of use to you.
    For example,
    squid proxy server, writes the time spent in ms,
    used to retrieve a particular web object, in its logs

    --
    arc_of_descent