Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

File download statistics parsing

by Anonymous Monk
on Aug 07, 2003 at 12:15 UTC ( [id://281853]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi. I'm parsing some download statistics from my Squid web logs with perl, and summarizing the file name, size, and number of downloads in a page for developers. After I started looking at the huge number of downloads we've been getting (several gigabytes a night), I realized that something must be amiss here. I think my script isn't taking into account dialup users and other users who get 206 Status, (Partial Content), and then continue their download later. Here's the code I have so far:
use strict; use warnings; use File::Basename; use File::stat; use File::Find; use Cwd; my ($root) = getcwd =~ /(.*)/; my $total; find( { untaint_pattern=>'.*', no_chdir => 1, wanted => sub { return unless /MyFoo.*\z/; my $v_snap_file = $File::Find::name; my $basefile = basename($v_snap_file); # I know this is evil, it's a hack. my $count = `/bin/grep $basefile /var/log/squid/access.log | /usr/bin/wc -l`; $count =~ s/^\s+//g; my $v_sb = stat("$v_snap_file"); my $v_filesize = $v_sb->size; my $v_bprecise = sprintf "%.0f", ($v_filesize); my $v_bsize = insert_commas($v_bprecise); my $v_kprecise = sprintf "%.0f", ($v_filesize/1024); my $v_ksize = insert_commas($v_kprecise); my $v_filedate = scalar localtime $v_sb->mtime; my $basename_v = basename($v_snap_file); print "File Name..: $basename_v\n"; print "File Size..: $v_bsize bytes ($v_ksize kb)\n"; print "Downloads..: ", insert_commas($count); my $tbytes = $v_filesize * $count; print "Total bytes: ", insert_commas($tbytes), "\n\n"; $total += $tbytes; } }, $root); print "\n", "-"x40, "\n"; print "Final total bytes: ", insert_commas($total), "\n\n"; sub insert_commas { my $text = reverse $_[0]; $text =~ s/(\d{3})(?=\d)(?!\d*\.)/$1,/g; return scalar reverse $text; }
The Squid log entries look like this (Yes, these are real entries):
wdcsun28.usdoj.gov - - [07/Aug/2003:04:58:15 -0700] "GET http://dl.dom +ain.org/MyFoo-file.zip HTTP/1.0" 200 1607158 TCP_MISS:DIRECT wdcsun28.usdoj.gov - - [07/Aug/2003:05:03:33 -0700] "GET http://dl.dom +ain.org/MyFoo-file.zip HTTP/1.0" 200 8224380 TCP_MISS:DIRECT
The numeric value right before the "TCP_MISS:DIRECT" is the file size. Notice that this generated two hits for what basically is one download. The real final file size for 'file.zip' is 8224380 bytes; just a little over 8 megs.

When I count these hits in the logs, and generate the stats for the number of bytes downloaded, I'd like to ignore the ones that are not "full" file downloads, by looking at that file size.

Any ideas how I can do this? The code above works, it just counts ALL hits in the logs, not "completed" hits in the logs. Did that make sense?

Replies are listed 'Best First'.
Re: File download statistics parsing
by BrowserUk (Patriarch) on Aug 07, 2003 at 12:59 UTC

    Method 1

    Pass 1: Build a hash using the path + size as the key and a count of the number of time you saw that combination as the value.

    Pass 2: Iterate over the keys extracting the path and use stat or -s to get the real size. If when the size doesn't match, delete that entry from the hash.

    Pass 3: Sum the sizes extracted from the remaining keys, * the associated counts. You have the total downloaded.

    Method 2

    Build a hash containing path as the key and size as the value of all the files available for download. Save this in a file using Storable or similar.

    Reload this hash at the start of the program and then as you parse the log, compare the size from the log from that in the hash and skip those where the size is disimilar.

    The latter method is good if the list of files available is fixed or only changes infrequently. Just re-run the indexer script each time an new or modified file is made available. If the list is huge and the indexer takes a long time to run, have a third script that allows you to manually update the stroed hash with a new or modified entry.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

Re: File download statistics parsing
by dda (Friar) on Aug 07, 2003 at 12:47 UTC
    Where in the log file is status 206? I can't see one. I assume that the only way out is to create a hash of your file names s with sizes, and to count only full downloads based on that hash. You can fill that hash automatically if you know where your files are placed.

    --dda

      Sorry, I didn't include log entries with 206's in them, but that isn't important here. Notice that I get the local file size by running this perl script in the directory where the downloaded files are located. That's how I calculate how many total bytes have been downloaded through the webserver.
      1. Stat local file
      2. Get local filename from file on disk
      3. Parse Squid logs for entries which match that filename
      4. Multiply number of entries in the logs for that filename by number of bytes that the file-on-disk occupies

      What I need to do, I think, is compare the file size of the local file, with the file size value in the log entry, and if it matches, count it as a "completed" download. If not, ignore it. I'm not sure this is accurate either though, because some downloaders can "resume" partial downloads.

      Another thought springs to mind though, what if I just concatenate the byte size value in the log itself, only, on a per-file basis, so I'm only parsing bytes out of the logs, not bytes from local files? That would at least allow me to see how many bytes the server sent to clients, but now I have to somehow correlate that on a per-file basis, which could require multiple passes through the logs. Not fun.

      I'm open to other ideas, if anyone has them.

        The log entries with a 206 return code do matter, especially when you have large files being downloaded. Since you are seeing 206 in your logs you will need to take these into account to get your results anywhere near accurate.

        The best algorithm for this is to go through the log files keeping track of how many bytes each user downloaded. If they add up to at least the size of the file, then the user probably completed a download.

        Unfortunately, it is not going to be possible for you to get a true accurate count of how many downloads completed successfully and how many were just partial.

        I see two problems you will be faced with given the structure of your logs:

        1. Your logs do not show the starting position for a 206 partial download (most log formats don't). Without this, you won't know if a user completed the whole download or just started it twice, downloading the first half each time.

        2. There does not seem to be any good way of uniquely identifying a user in your logs. Without this, it will be difficult to match up multiple 206 returns to add up the sizes to see if an individual user probably did or did not complete the full download.

        You may be able to get a better estimate than your current algorithm by assuming there is one user per IP address and adding up the bytes downloaded from each IP address. This can be improved by looking at the time between requests. If there is a half hour (you decide how long) with no request from an IP address, then further 206 responses are probably a new download attempt.

        One more hint: Your 206 sizes may add up to a bit larger than the original file size for a simple, successful download. This will happen for browsers that don't start the next segment right where the previous left off, but rather ask for the tail end of the previous segment (presumably to make sure that it matches what they got back form the previous request).

        If you have control over more than just the log file parser, you might insert a random parameter into each download URL so that you can track users better than IP address. For example, instead of

        href="/MyFoo-file.zip"
        you could set it to
        href="/MyFoo-file.zip?p=RANDOMNUMBER&ext=.zip"
        where "RANDOMUNMBER" is something likely to be unique generated at page load time by your preferred page generation technique.

        Note that the parameters on this URL will be completely ignored, but they will get logged to the web server access log which you are parsing.

        The "&ext=.zip" is a trick to get some broken browser versions to download and save the file with the right extension. Just make sure the complete URL ends with the extension of the original file.

Re: File download statistics parsing
by l2kashe (Deacon) on Aug 07, 2003 at 14:48 UTC
    # snip ... # I know this is evil, it's a hack. my $count = `/bin/grep $basefile /var/log/squid/access.log | /usr/bin/wc -l`; $count =~ s/^\s+//g; # so lets not hack it.. open(IN, "/var/log/squid/access.log") or die "opening access.log: $!\n +"; my $count = grep(/$basefile/, <IN>); close(IN);

    Happy hacking :)

    use perl;

Re: File download statistics parsing
by bean (Monk) on Aug 07, 2003 at 22:38 UTC
    I'm not as familiar with Squid logs as I used to be with Apache logs - is the numeric value before the TCP_MISS:DIRECT the amount downloaded, or the final byte offset of the range of bytes downloaded? If the latter is the case, your plan will work, and the logs you show here could mean that someone started a download and the download was interrupted and then resumed. If the former is the case, it means that the user started a download, the download failed, and the user started the download again from the beginning, succeeding the second time - it can't ever represent a resumed download, and you'll never count them. You'll also never count files downloaded using download accelerators, which request multiple byte ranges of a file in parallel.
      Never mind my previous post - I see that you know about the issue raised in it already. Guess I should have read all the replies first...

      However,
      although it won't help you analyse the current logs, if you could customize the logs to show the final byte offset of the range you could count a complete file download if the final byte offset matched the size of the actual file. There would probably be exceptions (I'm sure there are some perverted user agents that ask for the end of the file first) but it would be as accurate as any other method and a lot easier (I'm all about easier). Unfortunately, my lack of familiarity with Squid keeps me from knowing if it's possible to customize the logs in this way.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://281853]
Approved by broquaint
Front-paged by hacker
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-03-29 12:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found