comment on

Hi. I'm parsing some download statistics from my Squid web logs with perl, and summarizing the file name, size, and number of downloads in a page for developers. After I started looking at the huge number of downloads we've been getting (several gigabytes a night), I realized that something must be amiss here. I think my script isn't taking into account dialup users and other users who get 206 Status, (Partial Content), and then continue their download later. Here's the code I have so far:

use strict;
use warnings;

use File::Basename;
use File::stat;
use File::Find;
use Cwd;

my ($root) = getcwd =~ /(.*)/;

my $total;

find( {
    untaint_pattern=>'.*',
    no_chdir => 1,
    wanted => sub {
        return unless /MyFoo.*\z/;

        my $v_snap_file   = $File::Find::name;
        my $basefile      = basename($v_snap_file);

        # I know this is evil, it's a hack. 
        my $count = `/bin/grep $basefile
               /var/log/squid/access.log | /usr/bin/wc -l`;
        $count =~ s/^\s+//g;

        my $v_sb          = stat("$v_snap_file");
        my $v_filesize    = $v_sb->size;
        my $v_bprecise    = sprintf "%.0f", ($v_filesize);
        my $v_bsize       = insert_commas($v_bprecise);
        my $v_kprecise    = sprintf "%.0f",
                            ($v_filesize/1024);
        my $v_ksize       = insert_commas($v_kprecise);
        my $v_filedate    = scalar localtime $v_sb->mtime;
        my $basename_v    = basename($v_snap_file);

        print "File Name..: $basename_v\n";
        print "File Size..: $v_bsize bytes 
                            ($v_ksize kb)\n";
        print "Downloads..: ", insert_commas($count);

        my $tbytes        = $v_filesize * $count;
        print "Total bytes: ", 
                            insert_commas($tbytes), "\n\n";
        $total += $tbytes;
    }
}, $root);

print "\n", "-"x40, "\n";
print "Final total bytes: ", insert_commas($total), "\n\n";

sub insert_commas {
        my $text = reverse $_[0];
        $text =~ s/(\d{3})(?=\d)(?!\d*\.)/$1,/g;
        return scalar reverse $text;
}
[download]

The Squid log entries look like this (Yes, these are real entries):

wdcsun28.usdoj.gov - - [07/Aug/2003:04:58:15 -0700] "GET http://dl.dom
+ain.org/MyFoo-file.zip HTTP/1.0" 200 1607158 TCP_MISS:DIRECT

wdcsun28.usdoj.gov - - [07/Aug/2003:05:03:33 -0700] "GET http://dl.dom
+ain.org/MyFoo-file.zip HTTP/1.0" 200 8224380 TCP_MISS:DIRECT
[download]

The numeric value right before the "TCP_MISS:DIRECT" is the file size. Notice that this generated two hits for what basically is one download. The real final file size for 'file.zip' is 8224380 bytes; just a little over 8 megs.

When I count these hits in the logs, and generate the stats for the number of bytes downloaded, I'd like to ignore the ones that are not "full" file downloads, by looking at that file size.

Any ideas how I can do this? The code above works, it just counts ALL hits in the logs, not "completed" hits in the logs. Did that make sense?

In reply to File download statistics parsing by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.