Re: Re: File download statistics parsing

Sorry, I didn't include log entries with 206's in them, but that isn't important here. Notice that I get the local file size by running this perl script in the directory where the downloaded files are located. That's how I calculate how many total bytes have been downloaded through the webserver.

Stat local file
Get local filename from file on disk
Parse Squid logs for entries which match that filename
Multiply number of entries in the logs for that filename by number of bytes that the file-on-disk occupies

What I need to do, I think, is compare the file size of the local file, with the file size value in the log entry, and if it matches, count it as a "completed" download. If not, ignore it. I'm not sure this is accurate either though, because some downloaders can "resume" partial downloads.

Another thought springs to mind though, what if I just concatenate the byte size value in the log itself, only, on a per-file basis, so I'm only parsing bytes out of the logs, not bytes from local files? That would at least allow me to see how many bytes the server sent to clients, but now I have to somehow correlate that on a per-file basis, which could require multiple passes through the logs. Not fun.

I'm open to other ideas, if anyone has them.

Comment on Re: Re: File download statistics parsing

Replies are listed 'Best First'.
Re: Re: Re: File download statistics parsing by esh (Pilgrim) on Aug 07, 2003 at 20:57 UTC
The log entries with a 206 return code do matter, especially when you have large files being downloaded. Since you are seeing 206 in your logs you will need to take these into account to get your results anywhere near accurate. The best algorithm for this is to go through the log files keeping track of how many bytes each user downloaded. If they add up to at least the size of the file, then the user probably completed a download. Unfortunately, it is not going to be possible for you to get a true accurate count of how many downloads completed successfully and how many were just partial. I see two problems you will be faced with given the structure of your logs: 1. Your logs do not show the starting position for a 206 partial download (most log formats don't). Without this, you won't know if a user completed the whole download or just started it twice, downloading the first half each time. 2. There does not seem to be any good way of uniquely identifying a user in your logs. Without this, it will be difficult to match up multiple 206 returns to add up the sizes to see if an individual user probably did or did not complete the full download. You may be able to get a better estimate than your current algorithm by assuming there is one user per IP address and adding up the bytes downloaded from each IP address. This can be improved by looking at the time between requests. If there is a half hour (you decide how long) with no request from an IP address, then further 206 responses are probably a new download attempt. One more hint: Your 206 sizes may add up to a bit larger than the original file size for a simple, successful download. This will happen for browsers that don't start the next segment right where the previous left off, but rather ask for the tail end of the previous segment (presumably to make sure that it matches what they got back form the previous request). If you have control over more than just the log file parser, you might insert a random parameter into each download URL so that you can track users better than IP address. For example, instead of `href="/MyFoo-file.zip"` [download] you could set it to `href="/MyFoo-file.zip?p=RANDOMNUMBER&ext=.zip"` [download] where "RANDOMUNMBER" is something likely to be unique generated at page load time by your preferred page generation technique. Note that the parameters on this URL will be completely ignored, but they will get logged to the web server access log which you are parsing. The "&ext=.zip" is a trick to get some broken browser versions to download and save the file with the right extension. Just make sure the complete URL ends with the extension of the original file.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Re: Re: File download statistics parsing
by esh (Pilgrim) on Aug 07, 2003 at 20:57 UTC

The best algorithm for this is to go through the log files keeping track of how many bytes each user downloaded. If they add up to at least the size of the file, then the user probably completed a download.

Unfortunately, it is not going to be possible for you to get a true accurate count of how many downloads completed successfully and how many were just partial.

I see two problems you will be faced with given the structure of your logs:

1. Your logs do not show the starting position for a 206 partial download (most log formats don't). Without this, you won't know if a user completed the whole download or just started it twice, downloading the first half each time.

2. There does not seem to be any good way of uniquely identifying a user in your logs. Without this, it will be difficult to match up multiple 206 returns to add up the sizes to see if an individual user probably did or did not complete the full download.

You may be able to get a better estimate than your current algorithm by assuming there is one user per IP address and adding up the bytes downloaded from each IP address. This can be improved by looking at the time between requests. If there is a half hour (you decide how long) with no request from an IP address, then further 206 responses are probably a new download attempt.

One more hint: Your 206 sizes may add up to a bit larger than the original file size for a simple, successful download. This will happen for browsers that don't start the next segment right where the previous left off, but rather ask for the tail end of the previous segment (presumably to make sure that it matches what they got back form the previous request).

If you have control over more than just the log file parser, you might insert a random parameter into each download URL so that you can track users better than IP address. For example, instead of

    href="/MyFoo-file.zip"
[download]

    href="/MyFoo-file.zip?p=RANDOMNUMBER&ext=.zip"
[download]

Note that the parameters on this URL will be completely ignored, but they will get logged to the web server access log which you are parsing.

The "&ext=.zip" is a trick to get some broken browser versions to download and save the file with the right extension. Just make sure the complete URL ends with the extension of the original file.

[reply]
[d/l]
[select]