in reply to Help update the Phalanx 100
Procrastination is a joy in the week before Christmas...
I've taken a quick and dirty cut at it (please, no deductions for bad style or inefficiency). Like stvn said, the regex parts could use some work, but, frankly, there are so many special cases of how people name and number stuff that it's hard to cover everything. (Can anyone post what PAUSE and CPAN use internally?) The basic logic I use is this:
- After seeing the dates and UA's and running some basic counts on the number of hits by IP, I decided to simply drop as outliers any IPs in the top 1% of IPs by hitcount. The cutoff winds up being about 64 hits over the period studied.
- Only process things ending in .tar.gz or .tgz to avoid lots of non distribution stuff (scripts, readme, .sig, etc.)
- Strip off the .tar.gz or .gz; Strip off the distribution version. (Ugly hack of a regex, given all the varations people use but s/(-|_)[0-9][0-9a-z._]+$// seems to work decently.)
- Strip a trailing .pm, more for style than anything else
- Find the 1% cutoff of IP's
- Count the number of distribution downloads (but only if the downloading IP isn't past the cutoff.)
- Print the top 100
There's probably some additional, case-by-case, cleanup that could be done. (E.g., "perl" itself is in the top 100), but I think this is a decent start. Code and results follow.
#!/usr/bin/perl
use warnings;
use strict;
use Data::Hash::Totals;
my @log;
while (my $line = <>) {
next unless $line =~ m!(\S+)\s+(\S+)\s+\S*/(\S+)\s!;
my ($ip,$date,$dist) = ($1, $2, $3);
next unless $dist =~ s/\.(tar\.gz|tgz)$//;
$dist =~ s/(-|_)[0-9][0-9a-z._]+$//;
$dist =~ s/.pm$//;
push @log, { ip => $ip, date => $date, dist => $dist };
}
# Count IP and find a cutoff for the 99%ile of downloaders
my %ip;
for my $line (@log) {
$ip{$line->{ip}}++;
}
my $cut = int( 0.01 * keys %ip );
my $cutoff = [ sort { $ip{$b} <=> $ip{$a} } keys %ip ]->[$cut];
# Tally distributions for everyone else
my %dist;
for my $line (@log) {
$dist{$line->{dist}}++ if $ip{$line->{ip}} < $cutoff;
}
my %top100;
$top100{$_} = $dist{$_}
for splice( @{[sort { $dist{$b} <=> $dist{$a} } keys %dist ]}, 0,
+100 );
print as_table(\%top100);
17596 Net_SSLeay
13732 DBD-mysql
11138 DBI
8226 perl-ldap
7542 Mail-SpamAssassin
5528 GD
5440 libwww-perl
4557 HTML-Parser
3865 Digest-SHA1
3449 Digest
3397 CGI
3260 MIME-Base64
2868 XML-Parser
2786 Digest-MD5
2635 DBD-Pg
2630 MIME-tools
2625 File-Scan
2530 Compress-Zlib
2236 URI
2173 Net-DNS
2136 Time-HiRes
2130 Archive-Tar
2001 Test-Simple
1904 Tk
1767 DateManip
1743 Digest-HMAC
1650 HTML-Tagset
1629 MailTools
1617 libnet
1540 Gtk-Perl
1476 DB_File
1470 Archive-Zip
1418 DBD-Oracle
1400 Msql-Mysql-modules
1286 Apache-ASP
1286 HTML-Template
1138 Template-Toolkit
1134 IO-stringy
1124 Apache-MP3
1109 mod_perl
1087 MD5
1008 Storable
998 Module-Build
995 Crypt-CBC
972 Net-Telnet
952 CPAN
918 XML-Writer
916 Date-Calc
908 IMAP-Admin
900 TimeDate
836 Convert-ASN1
829 AppConfig
817 IO-String
800 GDGraph
787 Net-SNMP
783 MIME-Lite
783 XML-Generator
782 BerkeleyDB
773 Curses
763 AcePerl
760 PathTools
757 TermReadKey
747 Crypt-SSLeay
726 Convert-TNEF
714 Zanas
703 ExtUtils-MakeMaker
691 IO-Socket-SSL
662 HTML-Mason
655 Test-Harness
653 XML-Simple
624 bioperl
616 DBIx-SQLEngine
608 IO-Zlib
603 PodParser
601 GDTextUtil
599 PerlMagick
597 Parallel-Pvm
596 SOAP-Lite
571 Authen-SASL
557 AxKit-App-TABOO
557 Spreadsheet-WriteExcel
553 Bit-Vector
553 Data-Dumper
544 Parse-RecDescent
542 App-Info
533 perl
529 DBD-ODBC
528 Net-Server
525 Authen-PAM
520 Crypt-DES
519 Config-Maker
514 Bio-Das
512 File-Tail
505 Excel-Template
502 Boulder
502 XML-LibXML
500 Mail-ClamAV
498 IOC
496 Event
485 Apache-Session
-xdg
Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.
Re^2: Help update the Phalanx 100
by stvn (Monsignor) on Dec 21, 2004 at 16:58 UTC
|
So I was giving this some thought on my commute this morning as I was stuck in traffic, and the list produced by your code actually has helped convince me even more. I think that there is a problem with ignoring the revision number.
As I looked over your list, I noticed at the bottom was a module of mine IOC. Now, rather then be flattered by this, I know that it's height on the list is quite artificial. I decided to adopt the XP idea of "release early and release often" with this module. The first released version (0.01) was on Oct. 15th of this year, and there have been 18 subsequent versions released, the last one on the 15th of Dec. and at least 10 versions have been released within the range of this log file (Nov. 1 - Dec. 15th). When I ran my script (see above) like this:
grep 'IOC' ~/Desktop/cpan-gets | perl test.pl
I got the following output:
+---------------------------------------
| Total Downloads by Module
+------+--------------------------------
| 609 | IOC
+---------------------------------------
| Total Downloads by Distro
+------+--------------------------------
| 10 | IOC-0.06.tar.gz
| 10 | IOC-0.01.tar.gz
| 10 | IOC-0.17.tar.gz
| 10 | IOC-0.03.tar.gz
| 10 | IOC-0.04.tar.gz
| 10 | IOC-0.05.tar.gz
| 11 | IOC-0.02.tar.gz
| 18 | IOC-0.07.tar.gz
| 44 | IOC-0.09.tar.gz
| 46 | IOC-0.13.tar.gz
| 50 | IOC-0.12.tar.gz
| 52 | IOC-0.14.tar.gz
| 54 | IOC-0.10.tar.gz
| 59 | IOC-0.08.tar.gz
| 64 | IOC-0.15.tar.gz
| 66 | IOC-0.11.tar.gz
| 85 | IOC-0.16.tar.gz
+------+--------------------------------
Clearly this module is not one of the top 100 on CPAN.
I think we need to give some thought as to how to include revisions into the analysis. My first thought is to maybe take the number of revisions found on the list, and to use that to somehow weight the results. The more revisions the less weight basically. Another thought is to somehow account for the number of downloads per-revision. As I mentioned above, the fact each revision is being downloaded shows that someone is following the development of the module, and so that should taken into account.
In the end I agree, this is going to be a mixture of art and science to come up with these top 100.
| [reply] [d/l] [select] |
|
That's a very good point and a great example. I tried two more cuts. One is top 100 by average number of downloads per revision. The second is based on the vector sum (sqrt(x**2 + y**2)) of total downloads and average per revision. (Technically, I took the log of the total to flatten the skew, and normalized both metrics to a maximum of 100 before taking the vector sum). That latter one is probably pretty good -- it accounts for both criteria. Depending on one's bias, one could weight the two factors differently in the sum.
Results follow. Code for each of my three variations is available from my subversion repository.
-xdg
Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.
| [reply] [d/l] [select] |
|
For the simple reason that I KNOW Excel::Template can't be in any top-100 list, your second algorithm has to be wrong. :-)
Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.
| [reply] |
|
|
[~/perl/P100]$ cat rank-modules
#!/usr/bin/perl
use warnings;
use strict;
# Read CPAN module-download logs; find the most popular modules.
###
# Number of modules to list:
sub NrToPrint() {100}
# Any address that pulls more than MaxDownloadsPerDay modules in any o
+ne day
# has all its traffic ignored:
sub MaxDownloadsPerDay() {100}
# Exclude downloads from agents matching this regex, because they seem
+ to be
# related to mirroring or crawling rather than genuine downloads:
my $rx_agent_ignore = qr/
\. google \. |
\. yahoo \. |
# \b LWP::Simple \b |
\b MS\ Search \b |
\b Webmin \b |
\b Wget \b |
\b teoma \b
/x;
# First pass: build a hash of all client addresses that have downloade
+d more
# than MaxDownloadsPerDay modules in any one day:
my %bigusers;
sub find_big_users($) {
my $fh = $_[0];
seek $fh, 0, 0 or die "Can't rewind the input file:\n$!\n";
print STDERR "Finding heavy users...\n";
my %hpd; # hits per day: $hpd{client}{date} = number of hits
while (<$fh>) {
my ($client, $date) = m/
^ ( \d+ )
\s+
( [^:]+ )
/x or next;
# $hpd{$client}{$date} ||= 0;
++ $hpd{$client}{$date};
}
CLIENT:
while (my ($client, $rdatehash) = each %hpd) {
while (my ($date, $count) = each %$rdatehash) {
undef $bigusers{$client}, next CLIENT if $count > MaxDownl
+oadsPerDay;
}
}
}
# Second pass: ignoring traffic from heavy clients and robotic user ag
+ents,
# build a hash indexed by date, module and version and yielding a coun
+t of
# downloads:
my $rx_parse = qr!
^
( \d+ ) # Get client ID
\s
( [^:]+ ) # Get date
\S+ \s # Skip time
/ \S+ / # Skip directory
( \w \S*? ) # Get module name
- # Skip delimiter
( (?: (?> \d [^.]* ) \.? )+ ) # Get version number
\. \S+ \s # Skip file-type suffix
" ( .* ) " # Get user agent
!x;
my $rawdownloads = 0;
my $igbig = 0;
my $igagent = 0;
my $nrlines;
sub count_downloads($) {
my $fh = $_[0];
seek $fh, 0, 0 or die "Can't rewind the input file:\n$!\n";
print STDERR "Counting downloads...\n";
my %details;
while (<$fh>) {
my ($client, $date, $module, $version, $agent) = /$rx_parse/o
or next;
# print;
# print "Mod $module, ver $version\n";
++$rawdownloads;
++$igbig, next if exists $bigusers{$client};
++$igagent, next if $agent =~ $rx_agent_ignore;
++ $details{$date}{$module}{$version};
}
$nrlines = $.;
\%details;
}
# Third pass: if multiple versions of the same module have been reques
+ted on the
# same day, ignore all but the most popular version for that day. Thi
+s avoids
# giving extra weight to modules with many historical versions if a cl
+ient
# downloads all of them. Produce a hash
my $filtereddownloads = 0;
sub condense_multiple_versions($) {
my $rdetails = $_[0];
print STDERR "Analysing...\n";
my %grosscounts;
while (my ($date, $rmodhash) = each %$rdetails) {
while (my ($module, $rverhash) = each %$rmodhash) {
my @vercounts = sort {$a <=> $b} values %$rverhash;
$grosscounts{$module} += $vercounts[-1];
$filtereddownloads += $vercounts[-1];
}
}
\%grosscounts;
}
# Print the module counts and names in descending order of popularity:
sub print_results($) {
print STDERR "Using $filtereddownloads out of $rawdownloads downlo
+ads on $nrlines lines.\n",
"Skipped $igbig from heavy users and a further $igage
+nt apparently from robots.\n\n";
my $rcounts = $_[0];
my @sorted = sort {$rcounts->{$b} <=> $rcounts->{$a}} keys %$rcoun
+ts;
print map {sprintf "%-8d%s\n", $rcounts->{$_}, $_}
@sorted[0 .. NrToPrint - 1];
}
sub main() {
die "$0 <filename>\n" unless @ARGV == 1;
my $infile = shift @ARGV;
open my $fh, "<$infile" or die "Can't open $infile:\n$!\n";
find_big_users $fh;
print_results
condense_multiple_versions
count_downloads $fh;
}
main;
[~/perl/P100]$ ./rank-modules cpan-gets
Finding heavy users...
Counting downloads...
Analysing...
Using 104411 out of 1067155 downloads on 2328070 lines.
Skipped 767228 downloads from heavy users and a further 177523 apparen
+tly from robots.
2745 DBI
2312 File-Scan
1703 DBD-mysql
1219 XML-Parser
1202 HTML-Parser
1034 libwww-perl
984 GD
944 Gtk-Perl
880 Net_SSLeay.pm
859 Tk
827 DBD-Oracle
793 MIME-Base64
756 URI
751 Apache-ASP
746 Compress-Zlib
654 dmake
643 HTML-Template
640 Digest-MD5
602 Time-HiRes
592 Digest-SHA1
587 Archive-Tar
584 Net-Telnet
577 Template-Toolkit
548 Parallel-Pvm
540 XML-Writer
477 Archive-Zip
467 HTML-Tagset
464 libnet
437 Digest
406 AppConfig
401 MIME-tools
385 MailTools
359 Storable
356 Date-Calc
346 Msql-Mysql-modules
339 Test-Simple
338 CGI.pm
324 Module-Build
320 Spreadsheet-WriteExcel
318 SiePerl
317 perl-ldap
316 Net-DNS
314 DB_File
312 PAR
310 CPAN
310 TermReadKey
297 XML-Simple
297 IO-String
292 TimeDate
291 GDGraph
289 MIME-Lite
287 IO-stringy
287 Crypt-SSLeay
284 Curses
282 DBD-DB2
278 calendar
278 DateManip
277 Net-SNMP
274 Zanas
271 IMAP-Admin
270 MD5
268 ssltunnel
258 sms
257 Digest-HMAC
255 GDTextUtil
252 DBD-ODBC
252 DBD-Pg
245 gmailarchiver
245 IO-Socket-SSL
240 Data-Dumper
239 Mail-Sendmail
232 IOC
225 OLE-Storage_Lite
223 keywordsearch
217 ExtUtils-MakeMaker
206 XML-SAX
205 reboot
200 chres
199 Convert-ASN1
196 App-Info
196 Event
194 CGIscriptor
189 linkcheck
187 Test-Harness
184 glynx
184 Verilog-Perl
181 XLinks
180 Bit-Vector
179 mod_perl
178 SOAP-Lite
176 Expect
174 XML-DOM
174 MARC-Detrans
174 DBD-Sybase
173 Mail-SpamAssassin
172 Excel-Template
172 check_ftp
172 Compress-Zlib-Perl
171 Parse-RecDescent
171 Carp-Clan
[~/perl/P100]$
Update 23 Dec 2004:
I have:
- removed LWP::Simple from the list of ignorable user agents at stvn's suggestion,
- updated the results listing, and
- removed a fantastically noisy debugging statement that I inadvertently left in. (Apologies to anyone who ran the script and got barraged with raw data.)
Markus
| [reply] [d/l] |
|
# Exclude downloads from agents matching this regex, because they seem
+ to be
# related to mirroring or crawling rather than genuine downloads:
my $rx_agent_ignore = qr/
\. google \. |
\. yahoo \. |
\b LWP::Simple \b |
\b MS\ Search \b |
\b Webmin \b |
\b Wget \b |
\b teoma \b
/x;
Markus, I may be wrong, but I think that CPAN.pm uses LWP::Simple sometimes to download modules with, so excluding this would not be a good idea even though there is a good chance it could also be a spider.
| [reply] [d/l] |
|
|
| [reply] |
Re^2: Help update the Phalanx 100
by petdance (Parson) on Dec 21, 2004 at 17:10 UTC
|
Wow, excluding the top 1% of downloaders. Brilliant.
Something else to think about: I'm not concerned about absolute rankings so much as developing strata, as in http://qa.perl.org/phalanx/distros.html. I see this sort of like those "Greatest Albums Of All Time" lists. Maybe you can argue about whether Let It Bleed should come before or after Abbey Road, but both belong in the top 10, well before Pleased To Meet Me or Sign O' The Times.
| [reply] |
|
In my occasional noodlings on this topic, I've wondered what the dependency graph looks like. Which modules are most frequently used in other modules? (Including the recursion -- if A uses B and B uses C, D, and E, then the existence of A should increment the dependency count of C, D, and E, too.) Presumably, core modules would have the most links, but there are likely a second strata of heavily used utility modules, and so on out to narrow, single-purpose modules for particular applications. (Though those could also be very popular and worth of inclusion in a top 100 list.)
-xdg
Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.
| [reply] |
|
|