Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

topwebdiff - analyse the output of topweb

by grinder (Bishop)
on Sep 14, 2001 at 12:26 UTC ( [id://112378]=sourcecode: print w/replies, xml ) Need Help??
Category: web stuff
Author/Contact Info grinder on perlmonks
Description: To make the best use of topweb snapshots, the idea is to generate the files day by day, and then run topwebdiff to pinpoint the ranking changes.

See also topweb - Squid access.log analyser.
#! /usr/bin/perl -w
# david landgren  14-may-2001

use strict;

my $first  = shift or die "No first (current) file specified on comman
+d line.\n";
my $second = shift or die "No second (previous) file specified on comm
+and line.\n";

my %site;

open IN, $first or die "Cannot open $first for input: $!\n";
while( <IN> ) {
        my @fields = split;
        $site{ $fields[-1] } = \@fields;
close IN;

open IN, $second or die "Cannot open $second for input: $!\n";
while( <IN> ) {
        my ($rank, @fields) = split;
        local $" = "\t";
        if( defined $site{$fields[-1]} ) {
                my $prev = $site{ $fields[-1] }->[0];
                my $diff = $prev - $rank;
                my $desc = 0 == $diff ? '=' : $diff < 0 ? $diff : "+$d
                print "$rank\t$prev\t$desc\t@fields\n";
        else {
                print "$rank\t-\tnew\t@fields\n";
close IN;

=head1 NAME

topwebdiff -- analyse the output of successive runs of topweb


B<topwebdiff> filespec.recent filespec.older


Take the output of two runs of topweb, and create a report that shows 
sites have evolved between the two snapshots. This helps pinpoint site
that suddenly suck up a dramatic amount of bandwidth.


C<topwebdiff tw.yyyymmd1 tw.yyyymmd2>

The output is equivalent to the output of C<topweb tw.yyyymmd1> with t
addition of two columns in the second and third place:

=item *
rank 2 -- the rank of the same FQDN from the file tw.yyyymmd2, or '--'
+ if
the FQDN does not appear in the second file.

delta -- the change in rank from the second file (the older snapshot) 
comparison with the first file (the newer snapshot).

An excerpt of the output from a sample data set is as follows. In this
example we see a site has jumped from 55th most visited site (in terms
+ of
bytes transferred) to 27th.

 20 21 +1  5671  29919621  0.483%  25.064%
 21 20 -1  3532  27930698  0.451%  25.514%
 22 24 +2  11842 27849740  0.449%  25.964%
 23 22 -1  1807  25851714  0.417%  26.381%
 24 23 -1  4560  24280781  0.392%  26.773%
 25 26 +1  5326  24055482  0.388%  27.161%
 26 27 +1  3075  23879164  0.385%  27.546%
 27 55 +28 3943  199970 28 30 +2  2313  19803044  0.320%  28.188% webm
 29 25 -4  1446  19699499  0.318%  28.506%
 30 28 -2  998   19288520  0.311%  28.817%

Just how important this jump has to be weighed up with the number of f
+ile used
in generating the snapshot. In this instance, Squid is configured to r
+oll its
logs over every 24 hours, and 10 logs are kept. This means that the ou
+tput from
topweb (if run on all log files) will be a rolling 10-day average.


Copyright (c) 2001 David Landgren.

This script is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

=head1 AUTHOR

     David "grinder" Landgren
     grinder on perlmonks (
     eval {join chr(64) => qw[landgren]}


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://112378]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-04-23 23:03 GMT
Find Nodes?
    Voting Booth?

    No recent polls found