Hi. I'm parsing some download statistics from my Squid web logs with perl, and summarizing the file name, size, and number of downloads in a page for developers. After I started looking at the huge number of downloads we've been getting (several gigabytes a night), I realized that something must be amiss here. I think my script isn't taking into account dialup users and other users who get
206 Status, (Partial Content), and then continue their download later. Here's the code I have so far:
use strict;
use warnings;
use File::Basename;
use File::stat;
use File::Find;
use Cwd;
my ($root) = getcwd =~ /(.*)/;
my $total;
find( {
untaint_pattern=>'.*',
no_chdir => 1,
wanted => sub {
return unless /MyFoo.*\z/;
my $v_snap_file = $File::Find::name;
my $basefile = basename($v_snap_file);
# I know this is evil, it's a hack.
my $count = `/bin/grep $basefile
/var/log/squid/access.log | /usr/bin/wc -l`;
$count =~ s/^\s+//g;
my $v_sb = stat("$v_snap_file");
my $v_filesize = $v_sb->size;
my $v_bprecise = sprintf "%.0f", ($v_filesize);
my $v_bsize = insert_commas($v_bprecise);
my $v_kprecise = sprintf "%.0f",
($v_filesize/1024);
my $v_ksize = insert_commas($v_kprecise);
my $v_filedate = scalar localtime $v_sb->mtime;
my $basename_v = basename($v_snap_file);
print "File Name..: $basename_v\n";
print "File Size..: $v_bsize bytes
($v_ksize kb)\n";
print "Downloads..: ", insert_commas($count);
my $tbytes = $v_filesize * $count;
print "Total bytes: ",
insert_commas($tbytes), "\n\n";
$total += $tbytes;
}
}, $root);
print "\n", "-"x40, "\n";
print "Final total bytes: ", insert_commas($total), "\n\n";
sub insert_commas {
my $text = reverse $_[0];
$text =~ s/(\d{3})(?=\d)(?!\d*\.)/$1,/g;
return scalar reverse $text;
}
The Squid log entries look like this (Yes, these are real entries):
wdcsun28.usdoj.gov - - [07/Aug/2003:04:58:15 -0700] "GET http://dl.dom
+ain.org/MyFoo-file.zip HTTP/1.0" 200 1607158 TCP_MISS:DIRECT
wdcsun28.usdoj.gov - - [07/Aug/2003:05:03:33 -0700] "GET http://dl.dom
+ain.org/MyFoo-file.zip HTTP/1.0" 200 8224380 TCP_MISS:DIRECT
The numeric value right before the "TCP_MISS:DIRECT" is the file size. Notice that this generated two hits for what basically is one download. The real final file size for 'file.zip' is 8224380 bytes; just a little over 8 megs.
When I count these hits in the logs, and generate the stats for the number of bytes downloaded, I'd like to ignore the ones that are not "full" file downloads, by looking at that file size.
Any ideas how I can do this? The code above works, it just counts ALL hits in the logs, not "completed" hits in the logs. Did that make sense?