comment on

The log entries with a 206 return code do matter, especially when you have large files being downloaded. Since you are seeing 206 in your logs you will need to take these into account to get your results anywhere near accurate.

The best algorithm for this is to go through the log files keeping track of how many bytes each user downloaded. If they add up to at least the size of the file, then the user probably completed a download.

Unfortunately, it is not going to be possible for you to get a true accurate count of how many downloads completed successfully and how many were just partial.

I see two problems you will be faced with given the structure of your logs:

1. Your logs do not show the starting position for a 206 partial download (most log formats don't). Without this, you won't know if a user completed the whole download or just started it twice, downloading the first half each time.

2. There does not seem to be any good way of uniquely identifying a user in your logs. Without this, it will be difficult to match up multiple 206 returns to add up the sizes to see if an individual user probably did or did not complete the full download.

You may be able to get a better estimate than your current algorithm by assuming there is one user per IP address and adding up the bytes downloaded from each IP address. This can be improved by looking at the time between requests. If there is a half hour (you decide how long) with no request from an IP address, then further 206 responses are probably a new download attempt.

One more hint: Your 206 sizes may add up to a bit larger than the original file size for a simple, successful download. This will happen for browsers that don't start the next segment right where the previous left off, but rather ask for the tail end of the previous segment (presumably to make sure that it matches what they got back form the previous request).

If you have control over more than just the log file parser, you might insert a random parameter into each download URL so that you can track users better than IP address. For example, instead of

    href="/MyFoo-file.zip"
[download]

you could set it to

    href="/MyFoo-file.zip?p=RANDOMNUMBER&ext=.zip"
[download]

where "RANDOMUNMBER" is something likely to be unique generated at page load time by your preferred page generation technique.

Note that the parameters on this URL will be completely ignored, but they will get logged to the web server access log which you are parsing.

The "&ext=.zip" is a trick to get some broken browser versions to download and save the file with the right extension. Just make sure the complete URL ends with the extension of the original file.

In reply to Re: Re: Re: File download statistics parsing by esh
in thread File download statistics parsing by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.