note
MarkusLaker
I knocked up a quick program after reading Andy's comments on use.perl, and before finding this thread. It uses three ideas to eliminate the noise: it ignores all clients that download more than a hundred modules in a day; it ignores certain user agents that look like spiders; and it looks at all versions of a module downloaded each day, and ignores all but the most popular version. This last check is meant to filter out clients that download every version of a module within a few minutes.
<p>
What the program doesn't do yet is to mark standard modules.
<p>
Here are the code and results.
<p>
<code>
[~/perl/P100]$ cat rank-modules
#!/usr/bin/perl
use warnings;
use strict;
# Read CPAN module-download logs; find the most popular modules.
###
# Number of modules to list:
sub NrToPrint() {100}
# Any address that pulls more than MaxDownloadsPerDay modules in any one day
# has all its traffic ignored:
sub MaxDownloadsPerDay() {100}
# Exclude downloads from agents matching this regex, because they seem to be
# related to mirroring or crawling rather than genuine downloads:
my $rx_agent_ignore = qr/
\. google \. |
\. yahoo \. |
# \b LWP::Simple \b |
\b MS\ Search \b |
\b Webmin \b |
\b Wget \b |
\b teoma \b
/x;
# First pass: build a hash of all client addresses that have downloaded more
# than MaxDownloadsPerDay modules in any one day:
my %bigusers;
sub find_big_users($) {
my $fh = $_[0];
seek $fh, 0, 0 or die "Can't rewind the input file:\n$!\n";
print STDERR "Finding heavy users...\n";
my %hpd; # hits per day: $hpd{client}{date} = number of hits
while (<$fh>) {
my ($client, $date) = m/
^ ( \d+ )
\s+
( [^:]+ )
/x or next;
# $hpd{$client}{$date} ||= 0;
++ $hpd{$client}{$date};
}
CLIENT:
while (my ($client, $rdatehash) = each %hpd) {
while (my ($date, $count) = each %$rdatehash) {
undef $bigusers{$client}, next CLIENT if $count > MaxDownloadsPerDay;
}
}
}
# Second pass: ignoring traffic from heavy clients and robotic user agents,
# build a hash indexed by date, module and version and yielding a count of
# downloads:
my $rx_parse = qr!
^
( \d+ ) # Get client ID
\s
( [^:]+ ) # Get date
\S+ \s # Skip time
/ \S+ / # Skip directory
( \w \S*? ) # Get module name
- # Skip delimiter
( (?: (?> \d [^.]* ) \.? )+ ) # Get version number
\. \S+ \s # Skip file-type suffix
" ( .* ) " # Get user agent
!x;
my $rawdownloads = 0;
my $igbig = 0;
my $igagent = 0;
my $nrlines;
sub count_downloads($) {
my $fh = $_[0];
seek $fh, 0, 0 or die "Can't rewind the input file:\n$!\n";
print STDERR "Counting downloads...\n";
my %details;
while (<$fh>) {
my ($client, $date, $module, $version, $agent) = /$rx_parse/o
or next;
# print;
# print "Mod $module, ver $version\n";
++$rawdownloads;
++$igbig, next if exists $bigusers{$client};
++$igagent, next if $agent =~ $rx_agent_ignore;
++ $details{$date}{$module}{$version};
}
$nrlines = $.;
\%details;
}
# Third pass: if multiple versions of the same module have been requested on the
# same day, ignore all but the most popular version for that day. This avoids
# giving extra weight to modules with many historical versions if a client
# downloads all of them. Produce a hash
my $filtereddownloads = 0;
sub condense_multiple_versions($) {
my $rdetails = $_[0];
print STDERR "Analysing...\n";
my %grosscounts;
while (my ($date, $rmodhash) = each %$rdetails) {
while (my ($module, $rverhash) = each %$rmodhash) {
my @vercounts = sort {$a <=> $b} values %$rverhash;
$grosscounts{$module} += $vercounts[-1];
$filtereddownloads += $vercounts[-1];
}
}
\%grosscounts;
}
# Print the module counts and names in descending order of popularity:
sub print_results($) {
print STDERR "Using $filtereddownloads out of $rawdownloads downloads on $nrlines lines.\n",
"Skipped $igbig from heavy users and a further $igagent apparently from robots.\n\n";
my $rcounts = $_[0];
my @sorted = sort {$rcounts->{$b} <=> $rcounts->{$a}} keys %$rcounts;
print map {sprintf "%-8d%s\n", $rcounts->{$_}, $_}
@sorted[0 .. NrToPrint - 1];
}
sub main() {
die "$0 <filename>\n" unless @ARGV == 1;
my $infile = shift @ARGV;
open my $fh, "<$infile" or die "Can't open $infile:\n$!\n";
find_big_users $fh;
print_results
condense_multiple_versions
count_downloads $fh;
}
main;
[~/perl/P100]$ ./rank-modules cpan-gets
Finding heavy users...
Counting downloads...
Analysing...
Using 104411 out of 1067155 downloads on 2328070 lines.
Skipped 767228 downloads from heavy users and a further 177523 apparently from robots.
2745 DBI
2312 File-Scan
1703 DBD-mysql
1219 XML-Parser
1202 HTML-Parser
1034 libwww-perl
984 GD
944 Gtk-Perl
880 Net_SSLeay.pm
859 Tk
827 DBD-Oracle
793 MIME-Base64
756 URI
751 Apache-ASP
746 Compress-Zlib
654 dmake
643 HTML-Template
640 Digest-MD5
602 Time-HiRes
592 Digest-SHA1
587 Archive-Tar
584 Net-Telnet
577 Template-Toolkit
548 Parallel-Pvm
540 XML-Writer
477 Archive-Zip
467 HTML-Tagset
464 libnet
437 Digest
406 AppConfig
401 MIME-tools
385 MailTools
359 Storable
356 Date-Calc
346 Msql-Mysql-modules
339 Test-Simple
338 CGI.pm
324 Module-Build
320 Spreadsheet-WriteExcel
318 SiePerl
317 perl-ldap
316 Net-DNS
314 DB_File
312 PAR
310 CPAN
310 TermReadKey
297 XML-Simple
297 IO-String
292 TimeDate
291 GDGraph
289 MIME-Lite
287 IO-stringy
287 Crypt-SSLeay
284 Curses
282 DBD-DB2
278 calendar
278 DateManip
277 Net-SNMP
274 Zanas
271 IMAP-Admin
270 MD5
268 ssltunnel
258 sms
257 Digest-HMAC
255 GDTextUtil
252 DBD-ODBC
252 DBD-Pg
245 gmailarchiver
245 IO-Socket-SSL
240 Data-Dumper
239 Mail-Sendmail
232 IOC
225 OLE-Storage_Lite
223 keywordsearch
217 ExtUtils-MakeMaker
206 XML-SAX
205 reboot
200 chres
199 Convert-ASN1
196 App-Info
196 Event
194 CGIscriptor
189 linkcheck
187 Test-Harness
184 glynx
184 Verilog-Perl
181 XLinks
180 Bit-Vector
179 mod_perl
178 SOAP-Lite
176 Expect
174 XML-DOM
174 MARC-Detrans
174 DBD-Sybase
173 Mail-SpamAssassin
172 Excel-Template
172 check_ftp
172 Compress-Zlib-Perl
171 Parse-RecDescent
171 Carp-Clan
[~/perl/P100]$
</code>
<p>
<b>Update 23 Dec 2004:</b>
<p>
I have:
<p>
<ul>
<li>removed LWP::Simple from the list of ignorable user agents at stvn's suggestion,
<li>updated the results listing, and
<li>removed a fantastically noisy debugging statement that I inadvertently left in. (Apologies to anyone who ran the script and got barraged with raw data.)
</ul>
<p>
Markus
416363
416536