Re: Help update the Phalanx 100

Procrastination is a joy in the week before Christmas...

I've taken a quick and dirty cut at it (please, no deductions for bad style or inefficiency). Like stvn said, the regex parts could use some work, but, frankly, there are so many special cases of how people name and number stuff that it's hard to cover everything. (Can anyone post what PAUSE and CPAN use internally?) The basic logic I use is this:

After seeing the dates and UA's and running some basic counts on the number of hits by IP, I decided to simply drop as outliers any IPs in the top 1% of IPs by hitcount. The cutoff winds up being about 64 hits over the period studied.
Only process things ending in .tar.gz or .tgz to avoid lots of non distribution stuff (scripts, readme, .sig, etc.)
Strip off the .tar.gz or .gz; Strip off the distribution version. (Ugly hack of a regex, given all the varations people use but s/(-|_)[0-9][0-9a-z._]+$// seems to work decently.)
Strip a trailing .pm, more for style than anything else
Find the 1% cutoff of IP's
Count the number of distribution downloads (but only if the downloading IP isn't past the cutoff.)
Print the top 100

There's probably some additional, case-by-case, cleanup that could be done. (E.g., "perl" itself is in the top 100), but I think this is a decent start. Code and results follow.

#!/usr/bin/perl
use warnings;
use strict;
use Data::Hash::Totals;

my @log;
while (my $line = <>) {
    next unless $line =~ m!(\S+)\s+(\S+)\s+\S*/(\S+)\s!;
    my ($ip,$date,$dist) = ($1, $2, $3);
    next unless $dist =~ s/\.(tar\.gz|tgz)$//;
    $dist =~ s/(-|_)[0-9][0-9a-z._]+$//;
    $dist =~ s/.pm$//;
    push @log, { ip => $ip, date => $date, dist => $dist };
}

# Count IP and find a cutoff for the 99%ile of downloaders
my %ip;
for my $line (@log) {
    $ip{$line->{ip}}++;
}

my $cut =  int( 0.01 * keys %ip );
my $cutoff = [ sort { $ip{$b} <=> $ip{$a} } keys %ip ]->[$cut];

# Tally distributions for everyone else
my %dist;
for my $line (@log) {
    $dist{$line->{dist}}++ if $ip{$line->{ip}} < $cutoff;
}

my %top100;
$top100{$_} = $dist{$_} 
    for splice( @{[sort { $dist{$b} <=> $dist{$a} } keys %dist ]}, 0, 
+100 );

print as_table(\%top100);
[download]

17596 Net_SSLeay
13732 DBD-mysql
11138 DBI
 8226 perl-ldap
 7542 Mail-SpamAssassin
 5528 GD
 5440 libwww-perl
 4557 HTML-Parser
 3865 Digest-SHA1
 3449 Digest
 3397 CGI
 3260 MIME-Base64
 2868 XML-Parser
 2786 Digest-MD5
 2635 DBD-Pg
 2630 MIME-tools
 2625 File-Scan
 2530 Compress-Zlib
 2236 URI
 2173 Net-DNS
 2136 Time-HiRes
 2130 Archive-Tar
 2001 Test-Simple
 1904 Tk
 1767 DateManip
 1743 Digest-HMAC
 1650 HTML-Tagset
 1629 MailTools
 1617 libnet
 1540 Gtk-Perl
 1476 DB_File
 1470 Archive-Zip
 1418 DBD-Oracle
 1400 Msql-Mysql-modules
 1286 Apache-ASP
 1286 HTML-Template
 1138 Template-Toolkit
 1134 IO-stringy
 1124 Apache-MP3
 1109 mod_perl
 1087 MD5
 1008 Storable
  998 Module-Build
  995 Crypt-CBC
  972 Net-Telnet
  952 CPAN
  918 XML-Writer
  916 Date-Calc
  908 IMAP-Admin
  900 TimeDate
  836 Convert-ASN1
  829 AppConfig
  817 IO-String
  800 GDGraph
  787 Net-SNMP
  783 MIME-Lite
  783 XML-Generator
  782 BerkeleyDB
  773 Curses
  763 AcePerl
  760 PathTools
  757 TermReadKey
  747 Crypt-SSLeay
  726 Convert-TNEF
  714 Zanas
  703 ExtUtils-MakeMaker
  691 IO-Socket-SSL
  662 HTML-Mason
  655 Test-Harness
  653 XML-Simple
  624 bioperl
  616 DBIx-SQLEngine
  608 IO-Zlib
  603 PodParser
  601 GDTextUtil
  599 PerlMagick
  597 Parallel-Pvm
  596 SOAP-Lite
  571 Authen-SASL
  557 AxKit-App-TABOO
  557 Spreadsheet-WriteExcel
  553 Bit-Vector
  553 Data-Dumper
  544 Parse-RecDescent
  542 App-Info
  533 perl
  529 DBD-ODBC
  528 Net-Server
  525 Authen-PAM
  520 Crypt-DES
  519 Config-Maker
  514 Bio-Das
  512 File-Tail
  505 Excel-Template
  502 Boulder
  502 XML-LibXML
  500 Mail-ClamAV
  498 IOC
  496 Event
  485 Apache-Session
[download]

-xdg

Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.

Comment on Re: Help update the Phalanx 100 Select or Download Code

Replies are listed 'Best First'.
Re^2: Help update the Phalanx 100 by stvn (Monsignor) on Dec 21, 2004 at 16:58 UTC
So I was giving this some thought on my commute this morning as I was stuck in traffic, and the list produced by your code actually has helped convince me even more. I think that there is a problem with ignoring the revision number. As I looked over your list, I noticed at the bottom was a module of mine IOC. Now, rather then be flattered by this, I know that it's height on the list is quite artificial. I decided to adopt the XP idea of "release early and release often" with this module. The first released version (0.01) was on Oct. 15th of this year, and there have been 18 subsequent versions released, the last one on the 15th of Dec. and at least 10 versions have been released within the range of this log file (Nov. 1 - Dec. 15th). When I ran my script (see above) like this: `grep 'IOC' ~/Desktop/cpan-gets \| perl test.pl` [download] I got the following output: +--------------------------------------- \| Total Downloads by Module +------+-------------------------------- \| 609 \| IOC +--------------------------------------- \| Total Downloads by Distro +------+-------------------------------- \| 10 \| IOC-0.06.tar.gz \| 10 \| IOC-0.01.tar.gz \| 10 \| IOC-0.17.tar.gz \| 10 \| IOC-0.03.tar.gz \| 10 \| IOC-0.04.tar.gz \| 10 \| IOC-0.05.tar.gz \| 11 \| IOC-0.02.tar.gz \| 18 \| IOC-0.07.tar.gz \| 44 \| IOC-0.09.tar.gz \| 46 \| IOC-0.13.tar.gz \| 50 \| IOC-0.12.tar.gz \| 52 \| IOC-0.14.tar.gz \| 54 \| IOC-0.10.tar.gz \| 59 \| IOC-0.08.tar.gz \| 64 \| IOC-0.15.tar.gz \| 66 \| IOC-0.11.tar.gz \| 85 \| IOC-0.16.tar.gz +------+-------------------------------- [download] Clearly this module is not one of the top 100 on CPAN. I think we need to give some thought as to how to include revisions into the analysis. My first thought is to maybe take the number of revisions found on the list, and to use that to somehow weight the results. The more revisions the less weight basically. Another thought is to somehow account for the number of downloads per-revision. As I mentioned above, the fact each revision is being downloaded shows that someone is following the development of the module, and so that should taken into account. In the end I agree, this is going to be a mixture of art and science to come up with these top 100. -stvn	[reply] [d/l] [select]
Re^3: Help update the Phalanx 100 by xdg (Monsignor) on Dec 21, 2004 at 20:08 UTC
That's a very good point and a great example. I tried two more cuts. One is top 100 by average number of downloads per revision. The second is based on the vector sum (sqrt(x2 + y2)) of total downloads and average per revision. (Technically, I took the log of the total to flatten the skew, and normalized both metrics to a maximum of 100 before taking the vector sum). That latter one is probably pretty good -- it accounts for both criteria. Depending on one's bias, one could weight the two factors differently in the sum. Results follow. Code for each of my three variations is available from my subversion repository. Read more... (4 kB) -xdg Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.	[reply] [d/l] [select]
Re^4: Help update the Phalanx 100 by dragonchild (Archbishop) on Dec 21, 2004 at 20:16 UTC
For the simple reason that I KNOW Excel::Template can't be in any top-100 list, your second algorithm has to be wrong. :-) Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply]
Re^5: Help update the Phalanx 100 by xdg (Monsignor) on Dec 21, 2004 at 21:48 UTC
Re^3: Help update the Phalanx 100 by MarkusLaker (Beadle) on Dec 22, 2004 at 23:46 UTC
I knocked up a quick program after reading Andy's comments on use.perl, and before finding this thread. It uses three ideas to eliminate the noise: it ignores all clients that download more than a hundred modules in a day; it ignores certain user agents that look like spiders; and it looks at all versions of a module downloaded each day, and ignores all but the most popular version. This last check is meant to filter out clients that download every version of a module within a few minutes. What the program doesn't do yet is to mark standard modules. Here are the code and results. [~/perl/P100]$ cat rank-modules #!/usr/bin/perl use warnings; use strict; # Read CPAN module-download logs; find the most popular modules. ### # Number of modules to list: sub NrToPrint() {100} # Any address that pulls more than MaxDownloadsPerDay modules in any o +ne day # has all its traffic ignored: sub MaxDownloadsPerDay() {100} # Exclude downloads from agents matching this regex, because they seem + to be # related to mirroring or crawling rather than genuine downloads: my $rx_agent_ignore = qr/ \. google \. \| \. yahoo \. \| # \b LWP::Simple \b \| \b MS\ Search \b \| \b Webmin \b \| \b Wget \b \| \b teoma \b /x; # First pass: build a hash of all client addresses that have downloade +d more # than MaxDownloadsPerDay modules in any one day: my %bigusers; sub find_big_users($) { my $fh = $_[0]; seek $fh, 0, 0 or die "Can't rewind the input file:\n$!\n"; print STDERR "Finding heavy users...\n"; my %hpd; # hits per day: $hpd{client}{date} = number of hits while (<$fh>) { my ($client, $date) = m/ ^ ( \d+ ) \s+ ( [^:]+ ) /x or next; # $hpd{$client}{$date} \|\|= 0; ++ $hpd{$client}{$date}; } CLIENT: while (my ($client, $rdatehash) = each %hpd) { while (my ($date, $count) = each %$rdatehash) { undef $bigusers{$client}, next CLIENT if $count > MaxDownl +oadsPerDay; } } } # Second pass: ignoring traffic from heavy clients and robotic user ag +ents, # build a hash indexed by date, module and version and yielding a coun +t of # downloads: my $rx_parse = qr! ^ ( \d+ ) # Get client ID \s ( [^:]+ ) # Get date \S+ \s # Skip time / \S+ / # Skip directory ( \w \S? ) # Get module name - # Skip delimiter ( (?: (?> \d [^.] ) \.? )+ ) # Get version number \. \S+ \s # Skip file-type suffix " ( .* ) " # Get user agent !x; my $rawdownloads = 0; my $igbig = 0; my $igagent = 0; my $nrlines; sub count_downloads($) { my $fh = $_[0]; seek $fh, 0, 0 or die "Can't rewind the input file:\n$!\n"; print STDERR "Counting downloads...\n"; my %details; while (<$fh>) { my ($client, $date, $module, $version, $agent) = /$rx_parse/o or next; # print; # print "Mod $module, ver $version\n"; ++$rawdownloads; ++$igbig, next if exists $bigusers{$client}; ++$igagent, next if $agent =~ $rx_agent_ignore; ++ $details{$date}{$module}{$version}; } $nrlines = $.; \%details; } # Third pass: if multiple versions of the same module have been reques +ted on the # same day, ignore all but the most popular version for that day. Thi +s avoids # giving extra weight to modules with many historical versions if a cl +ient # downloads all of them. Produce a hash my $filtereddownloads = 0; sub condense_multiple_versions($) { my $rdetails = $_[0]; print STDERR "Analysing...\n"; my %grosscounts; while (my ($date, $rmodhash) = each %$rdetails) { while (my ($module, $rverhash) = each %$rmodhash) { my @vercounts = sort {$a <=> $b} values %$rverhash; $grosscounts{$module} += $vercounts[-1]; $filtereddownloads += $vercounts[-1]; } } \%grosscounts; } # Print the module counts and names in descending order of popularity: sub print_results($) { print STDERR "Using $filtereddownloads out of $rawdownloads downlo +ads on $nrlines lines.\n", "Skipped $igbig from heavy users and a further $igage +nt apparently from robots.\n\n"; my $rcounts = $_[0]; my @sorted = sort {$rcounts->{$b} <=> $rcounts->{$a}} keys %$rcoun +ts; print map {sprintf "%-8d%s\n", $rcounts->{$_}, $_} @sorted[0 .. NrToPrint - 1]; } sub main() { die "$0 <filename>\n" unless @ARGV == 1; my $infile = shift @ARGV; open my $fh, "<$infile" or die "Can't open $infile:\n$!\n"; find_big_users $fh; print_results condense_multiple_versions count_downloads $fh; } main; [~/perl/P100]$ ./rank-modules cpan-gets Finding heavy users... Counting downloads... Analysing... Using 104411 out of 1067155 downloads on 2328070 lines. Skipped 767228 downloads from heavy users and a further 177523 apparen +tly from robots. 2745 DBI 2312 File-Scan 1703 DBD-mysql 1219 XML-Parser 1202 HTML-Parser 1034 libwww-perl 984 GD 944 Gtk-Perl 880 Net_SSLeay.pm 859 Tk 827 DBD-Oracle 793 MIME-Base64 756 URI 751 Apache-ASP 746 Compress-Zlib 654 dmake 643 HTML-Template 640 Digest-MD5 602 Time-HiRes 592 Digest-SHA1 587 Archive-Tar 584 Net-Telnet 577 Template-Toolkit 548 Parallel-Pvm 540 XML-Writer 477 Archive-Zip 467 HTML-Tagset 464 libnet 437 Digest 406 AppConfig 401 MIME-tools 385 MailTools 359 Storable 356 Date-Calc 346 Msql-Mysql-modules 339 Test-Simple 338 CGI.pm 324 Module-Build 320 Spreadsheet-WriteExcel 318 SiePerl 317 perl-ldap 316 Net-DNS 314 DB_File 312 PAR 310 CPAN 310 TermReadKey 297 XML-Simple 297 IO-String 292 TimeDate 291 GDGraph 289 MIME-Lite 287 IO-stringy 287 Crypt-SSLeay 284 Curses 282 DBD-DB2 278 calendar 278 DateManip 277 Net-SNMP 274 Zanas 271 IMAP-Admin 270 MD5 268 ssltunnel 258 sms 257 Digest-HMAC 255 GDTextUtil 252 DBD-ODBC 252 DBD-Pg 245 gmailarchiver 245 IO-Socket-SSL 240 Data-Dumper 239 Mail-Sendmail 232 IOC 225 OLE-Storage_Lite 223 keywordsearch 217 ExtUtils-MakeMaker 206 XML-SAX 205 reboot 200 chres 199 Convert-ASN1 196 App-Info 196 Event 194 CGIscriptor 189 linkcheck 187 Test-Harness 184 glynx 184 Verilog-Perl 181 XLinks 180 Bit-Vector 179 mod_perl 178 SOAP-Lite 176 Expect 174 XML-DOM 174 MARC-Detrans 174 DBD-Sybase 173 Mail-SpamAssassin 172 Excel-Template 172 check_ftp 172 Compress-Zlib-Perl 171 Parse-RecDescent 171 Carp-Clan [~/perl/P100]$ [download] Update 23 Dec 2004: I have: removed LWP::Simple from the list of ignorable user agents at stvn's suggestion, updated the results listing, and removed a fantastically noisy debugging statement that I inadvertently left in. (Apologies to anyone who ran the script and got barraged with raw data.) Markus	[reply] [d/l]
Re^4: Help update the Phalanx 100 by stvn (Monsignor) on Dec 23, 2004 at 13:49 UTC
`# Exclude downloads from agents matching this regex, because they seem + to be # related to mirroring or crawling rather than genuine downloads: my $rx_agent_ignore = qr/ � � \. google \. � � � � � �\| � � \. yahoo �\. � � � � � �\| � � \b LWP::Simple \b � � � \| � � \b MS\ Search \b � � � �\| � � \b Webmin \b � � � � � �\| � � \b Wget \b � � � � � � �\| � � \b teoma \b /x;` [download] Markus, I may be wrong, but I think that CPAN.pm uses LWP::Simple sometimes to download modules with, so excluding this would not be a good idea even though there is a good chance it could also be a spider. -stvn	[reply] [d/l]
Re^5: Help update the Phalanx 100 by MarkusLaker (Beadle) on Dec 23, 2004 at 22:21 UTC
Re^4: Help update the Phalanx 100 by petdance (Parson) on Jan 09, 2005 at 05:30 UTC
Wget is absolutely a valid agent. It's what I use to download stuff to the command line so I can install the module. And Webmin is a package that people use for web-based maintenance. It's not a bot. That one needs to stay in, too. xoxo, Andy	[reply]
Re^2: Help update the Phalanx 100 by petdance (Parson) on Dec 21, 2004 at 17:10 UTC
Wow, excluding the top 1% of downloaders. Brilliant. Something else to think about: I'm not concerned about absolute rankings so much as developing strata, as in http://qa.perl.org/phalanx/distros.html. I see this sort of like those "Greatest Albums Of All Time" lists. Maybe you can argue about whether Let It Bleed should come before or after Abbey Road, but both belong in the top 10, well before Pleased To Meet Me or Sign O' The Times. xoxo, Andy	[reply]
Re^3: Help update the Phalanx 100 by xdg (Monsignor) on Dec 21, 2004 at 18:15 UTC
In my occasional noodlings on this topic, I've wondered what the dependency graph looks like. Which modules are most frequently used in other modules? (Including the recursion -- if A uses B and B uses C, D, and E, then the existence of A should increment the dependency count of C, D, and E, too.) Presumably, core modules would have the most links, but there are likely a second strata of heavily used utility modules, and so on out to narrow, single-purpose modules for particular applications. (Though those could also be very popular and worth of inclusion in a top 100 list.) -xdg Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.	[reply]


"be consistent"
	PerlMonks