Perl Programming Logic

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perl Programming Logic by VSarkiss (Monsignor) on Jul 01, 2002 at 18:49 UTC
This isn't really a Perl problem, it's a data problem. You can't really tell how long someone spent reading a web page from looking at logs, in general. The problem is that the log only has the times the browser sent out a GET or POST or other HTTP request, and when the web site responded (generally). You can't tell from that how long someone "interacted" with a site. I can download a Java or Flash game from a site with one GET, then spend several hours playing with nothing getting logged. Similarly, I can retrieve a single page, close the browser, and retrieve another page several hours later. The log can't tell you that I didn't even have the browser open in between those two times.	[reply]
Re: Perl Programming Logic by caedes (Pilgrim) on Jul 01, 2002 at 18:50 UTC
It seems that your understanding of the client-server communications for HTTP are a bit inaccurate. All that the server sees when a user goes to the site is "give me xxxxxx" followed possibly by "Give me yyyyyyy". You might interpret that to mean that the user spend the time between viewing xxxx and yyyyy looking at the page, but that isn't neccesarily the case. Another point is that it is impossible to tell when a user "closed the browser", however you can assume that they have left your site after they don't request a document for a given length of time. The whole here being that you have to make reasonable assumptions in order to hopefully get an idea about what might have happened. As for solving your problem, I would probably split the log files up by IP address (which may or may not stay the same for a given user, but that is another discussion). then set a length of time that you consider to be too long to view one page, say 1 hour. Then whenever one hour goes by for a single user between page request you interpret that to mean the person quit surfing. I hope this helps you out some. ;-) -caedes	[reply]
Re: Re: Perl Programming Logic by Anonymous Monk on Jul 01, 2002 at 18:58 UTC
By the way, I know this must be possible as I used to work for a company that used a product (relatively expensive, I might add) called "WebTrends for firewalls & VPN's" that would do exactly what I would like to do.	[reply]
Re: Re: Re: Perl Programming Logic by Anonymous Monk on Jul 01, 2002 at 22:21 UTC
It is trivial to produce a number and claim it means something. It is much harder to produce a number that really means what you have claimed. The fact that a proprietary product claims to accomplish a goal is not always very good evidence that that goal is, in fact, technically accomplishable.	[reply]
Re: Perl Programming Logic by newrisedesigns (Curate) on Jul 01, 2002 at 19:57 UTC
You'll need to either get a better tracking system (other than a log), or you'll have to get creative. For tracking, might I suggest first pulling out all instances of each user and dumping them into one group. Put all your 192.168.1.214's into one group, and all your other IPs into their own groups. Then, check all the URLs of each person. To make it easier, you could strip out requests for known places like ads.x10.com, seeing that that request is most likely a pop-up and not an intended request. Creativity steps in here. You will have to assume that a user that requests /index.html /left.html /right.html in 5 seconds just loaded a frameset. Consider that 1 request. Now your user gets /index.html, then /blue.html, then /red.html, then /green.html over a 3 minute period. Four requests, that did not occur in a few-second time frame. The user should be considered "surfing" for those 3 minutes, because those requests are more than likely requests made by a human. Now bringing relative time into the situation can cause some headaches, too. When I surf, I usually have 3-10 windows open. You will need to find some way to distinguish a clicked link from a cold request and a method to determine what kind of information is held on the requested page. You could do that with LWP, can read through the file to see if there's tags for Shockwave games, streaming video, large amounts of text (online books) and draw conclusions from those results and your request log. It can be done, but it will be difficult without some other form of recording information. If you have the resources, you could set up a webproxy that monitors clicked links, and observe information that way. You could avoid Perl altogether, install BackOrifice on the machine you want to monitor and watch what your users are viewing. Anonymonk, might I suggest that you create a user name and stay awhile. :) John J Reiser newrisedesigns.com	[reply]
Re: Perl Programming Logic by perigeeV (Hermit) on Jul 01, 2002 at 19:08 UTC
To accurately track a person's click path you need more than logs. Log information cannot differentiate between users sharing a proxy, or different users that have been sequentially assigned the same communal IP adddress, like a dialup ISP user. You can assign a session ID to a user and track that ID number. Super Search for "maintaining state" or some such. To really know how long someone is viewing a page you would have to use a client refresh at timed intervals. For instance, you could have some javascript that updates a dummy one-pixel image at regular intervals. The refresh would include that users unique ID, thus you just sum the times between each refresh.	[reply]
Re: Perl Programming Logic by Abigail-II (Bishop) on Jul 02, 2002 at 10:12 UTC
Your biggest problem is not the logic of the problem, but the logic of what you want to do. You are data mining over HTTP log files. HTTP is essentially a stateless, sessionless protocol. Yet you want to measure the length of "sessions" somehow. You're up for a failure. You want to measure something that isn't really there. And it isn't just "quitting browsing" that will spoil your day. When I hit "preview" in a minute, it's likely that it takes a while before the request goes to perlmonks, perlmonks does what it wants to do, the request is back, the ad has been fetched and the page is displayed. I'll switch to IRC, p5p or do some actual work before I return my attention to the preview page. There might be 20 minutes between hitting 'preview' and 'submit' on the next page. Did I "browse" for 20 minutes? No, I probably won't even spend 20 seconds. Oh, did I mention that the user name for the proxy I'm using is shared with a whole bunch of people, and that we're rotating between several proxies? That would really screw up your analysis, wouldn't it? ;-) My suggestion: give up on the idea. It's utterly useless, the data you have can't measure what you want to measure, and what you want to measure doesn't have much connection to what you want to know anyway. Abigail	[reply]
Re: Perl Programming Logic by grantm (Parson) on Jul 02, 2002 at 11:48 UTC
The author of Analog has written an article on how the web works and what can and can't be determined by analysing logs. I have spent quite a lot of time working with both Analog and WebTrends. I recommend the former (with Report Magic) for people who want to understand site usage patterns and the latter for people who have lots of money and a willingness to base business decisions on nicely presented but completely meaningless numbers. Even when I phrase it exactly like that, it's amazing how many clients go for the latter.	[reply]
Re: Perl Programming Logic by arc_of_descent (Hermit) on Jul 02, 2002 at 13:59 UTC
Hi, If the time spent in retrieving the particular page justifies your meaning of time spent in viewing the page, then the value could be of use to you. For example, squid proxy server, writes the time spent in ms, used to retrieve a particular web object, in its logs -- arc_of_descent	[reply]