The Phalanx 100 is a list of the "top 100" modules on CPAN, and by
extension, those that should have the most attention paid to them by
the Phalanx project.
The first time I generated the P100 was over a year ago, and things
are old and stale. Distributions have changed names (CGI::Kwiki is
now Kwiki, for example). Some distros have come and some have gone.
It's time to be updated.
This time, YOU can help determine the P100. The source data, generated
from logs from the main CPAN mirror at pair.com, is available for download
at http://petdance.com/random/cpan-gets.gz. Write code that analyzes
the data, and generates the top 100 modules.
What should your code do? It's up to you! Publish the code somewhere
(use.perl.org, perlmonks, whatever) and let me see it. I'm not sure if
I'll take someone's decisions directly, or use ideas, or how I'll do it,
but the more working code I have to pick from, the better.
Also, the last time I created a P100, I omitted any modules that
were in the core distribution. This time, I do want to include core
modules, although I do want to have them noted somehow. Richard Clamp's
Module::CoreList will be a great help with this.
Whatever you do, however you do it, I need to know about your code no
later than January 10th, 2005. Email me at andy at petdance.com.
There's going to be an article about the Phalanx project going up on
perl.com soon after that, and I need to have an updated version of the
P100 up (replacing http://qa.perl.org/phalanx/distros.html) by then.
About the data
I used the following code to analyze data from the Apache logs for the
main CPAN mirror at Pair.com from November 1 to December 15th, 2004.
#!/usr/bin/perl
use strict;
use warnings;
my %id;
my $next_id = 10000;
while (<>) {
next unless m!^\S+ (\S+) .+ "GET ([^"]+) HTTP/\d\.\d" 200!;
my ($ip,$path) = ($1,$2);
study $path;
# Skip directories
next if $path =~ /\/$/; # Directory
next if $path =~ /\/\?/; # Directory with sort parms
# Skip certain directories
next if $path =~ /^\/(icons|misc|ports|src)\//;
# Skip certain file extensions
next if $path =~ /\.(rss|html|meta|readme)$/;
# Skip CPAN & distro maintenance stuff
next if $path =~ /CHECKSUMS$/;
next if $path =~ /MIRRORING/;
# Module list stuff
next if $path =~ /\Q00whois./;
next if $path =~ /\Q01mailrc./;
next if $path =~ /\Q02packages.details/;
next if $path =~ /\Q03modlist./;
my $id = ($id{$ip} ||= ++$next_id);
print "$id $path\n";
}
This gives lines like this:
16395 /authors/id/K/KE/KESTER/WWW-Yahoo-DrivingDirections-0.07.tar
+.gz
10001 /authors/id/K/KW/KWOOLERY/Buzznet-API-0.01.tar.gz
85576 /authors/id/J/JR/JROGERS/Net-Telnet-3.01.tar.gz
85576 /authors/id/J/JR/JROGERS/Net-Telnet-3.02.tar.gz
85576 /authors/id/J/JR/JROGERS/Net-Telnet-3.03.tar.gz
The 5-digit number is an ID number for a given IP address. I found
that some IPs were routinely slurping down entire histories of modules,
which probably will skew statistics to those with a lot of revisions.
How should these be accounted for in the analysis? I don't know.
That's one of the reasons that I put this out for all to work on.
I welcome your comments, suggestions and help on this.
Re: Help update the Phalanx 100
by xdg (Monsignor) on Dec 21, 2004 at 03:48 UTC
|
Would it be possible to regenerate this data including a datestamp? That would facilitate filtering out mass revision downloads, mirroring, etc. As a rough algorithm, I'd guess that counting each distribution (regardless of revision) once per IP per day would be a reasonable first-approximation.
-xdg
Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.
| [reply] |
|
As a rough algorithm, I'd guess that counting each distribution (regardless of revision) once per IP per day would be a reasonable first-approximation.
Part of me agrees that revisions should probably be ignored, but then the counts will likely skew upwards if there are many revisions. However that skew could be viewed an indicator that the modules development is being followed regularly and not just downloaded as a curiosity. It's a difficult one.
As for the "once per IP" idea, I have found with log analysis that IPs can be deceiving because of proxy servers. So I think that we should be wary of weighting the IPs too much.
| [reply] |
|
OK, you've now got the timestamp and the user agent. Lotsa Googlebot action there.
| [reply] |
Re: Help update the Phalanx 100
by stvn (Monsignor) on Dec 21, 2004 at 04:47 UTC
|
So, I am giving some thought to this,... but in the meantime I thought it would be interesting to see just some very basic raw download counts. In particular to see raw download counts on my own modules. (After all, CPAN doesn't automatically give this type of info and its not everyday you get access to nicely cleaned up logs like this) So I whipped this up, just grep the raw log for a PAUSE id and pipe the output into this script (passing in 'module' or 'distro' or nothing for both).
BTW - The reg-exp for extracting the module name is horrible and could use a lot of help (*hint* *hint*).
| [reply] [d/l] |
Re: Help update the Phalanx 100
by xdg (Monsignor) on Dec 21, 2004 at 13:05 UTC
|
This process strikes me as one that is a mix of art and science. An additional thought is that you might wish to cross-check the "top 100" from CPAN logs with the modules that are packaged for inclusion in various *nix distributions, which indicates some degree of demand. Likewise, if you can get access to ActiveState PPM logs, that would be another valuable source of data.
-xdg
Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.
| [reply] |
Re: Help update the Phalanx 100
by xdg (Monsignor) on Dec 21, 2004 at 16:11 UTC
|
Procrastination is a joy in the week before Christmas...
I've taken a quick and dirty cut at it (please, no deductions for bad style or inefficiency). Like stvn said, the regex parts could use some work, but, frankly, there are so many special cases of how people name and number stuff that it's hard to cover everything. (Can anyone post what PAUSE and CPAN use internally?) The basic logic I use is this:
- After seeing the dates and UA's and running some basic counts on the number of hits by IP, I decided to simply drop as outliers any IPs in the top 1% of IPs by hitcount. The cutoff winds up being about 64 hits over the period studied.
- Only process things ending in .tar.gz or .tgz to avoid lots of non distribution stuff (scripts, readme, .sig, etc.)
- Strip off the .tar.gz or .gz; Strip off the distribution version. (Ugly hack of a regex, given all the varations people use but s/(-|_)[0-9][0-9a-z._]+$// seems to work decently.)
- Strip a trailing .pm, more for style than anything else
- Find the 1% cutoff of IP's
- Count the number of distribution downloads (but only if the downloading IP isn't past the cutoff.)
- Print the top 100
There's probably some additional, case-by-case, cleanup that could be done. (E.g., "perl" itself is in the top 100), but I think this is a decent start. Code and results follow.
-xdg
Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.
| [reply] [d/l] [select] |
|
So I was giving this some thought on my commute this morning as I was stuck in traffic, and the list produced by your code actually has helped convince me even more. I think that there is a problem with ignoring the revision number.
As I looked over your list, I noticed at the bottom was a module of mine IOC. Now, rather then be flattered by this, I know that it's height on the list is quite artificial. I decided to adopt the XP idea of "release early and release often" with this module. The first released version (0.01) was on Oct. 15th of this year, and there have been 18 subsequent versions released, the last one on the 15th of Dec. and at least 10 versions have been released within the range of this log file (Nov. 1 - Dec. 15th). When I ran my script (see above) like this:
grep 'IOC' ~/Desktop/cpan-gets | perl test.pl
I got the following output:
+---------------------------------------
| Total Downloads by Module
+------+--------------------------------
| 609 | IOC
+---------------------------------------
| Total Downloads by Distro
+------+--------------------------------
| 10 | IOC-0.06.tar.gz
| 10 | IOC-0.01.tar.gz
| 10 | IOC-0.17.tar.gz
| 10 | IOC-0.03.tar.gz
| 10 | IOC-0.04.tar.gz
| 10 | IOC-0.05.tar.gz
| 11 | IOC-0.02.tar.gz
| 18 | IOC-0.07.tar.gz
| 44 | IOC-0.09.tar.gz
| 46 | IOC-0.13.tar.gz
| 50 | IOC-0.12.tar.gz
| 52 | IOC-0.14.tar.gz
| 54 | IOC-0.10.tar.gz
| 59 | IOC-0.08.tar.gz
| 64 | IOC-0.15.tar.gz
| 66 | IOC-0.11.tar.gz
| 85 | IOC-0.16.tar.gz
+------+--------------------------------
Clearly this module is not one of the top 100 on CPAN.
I think we need to give some thought as to how to include revisions into the analysis. My first thought is to maybe take the number of revisions found on the list, and to use that to somehow weight the results. The more revisions the less weight basically. Another thought is to somehow account for the number of downloads per-revision. As I mentioned above, the fact each revision is being downloaded shows that someone is following the development of the module, and so that should taken into account.
In the end I agree, this is going to be a mixture of art and science to come up with these top 100.
| [reply] [d/l] [select] |
|
That's a very good point and a great example. I tried two more cuts. One is top 100 by average number of downloads per revision. The second is based on the vector sum (sqrt(x**2 + y**2)) of total downloads and average per revision. (Technically, I took the log of the total to flatten the skew, and normalized both metrics to a maximum of 100 before taking the vector sum). That latter one is probably pretty good -- it accounts for both criteria. Depending on one's bias, one could weight the two factors differently in the sum.
Results follow. Code for each of my three variations is available from my subversion repository.
-xdg
Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.
| [reply] [d/l] [select] |
|
|
|
[~/perl/P100]$ cat rank-modules
#!/usr/bin/perl
use warnings;
use strict;
# Read CPAN module-download logs; find the most popular modules.
###
# Number of modules to list:
sub NrToPrint() {100}
# Any address that pulls more than MaxDownloadsPerDay modules in any o
+ne day
# has all its traffic ignored:
sub MaxDownloadsPerDay() {100}
# Exclude downloads from agents matching this regex, because they seem
+ to be
# related to mirroring or crawling rather than genuine downloads:
my $rx_agent_ignore = qr/
\. google \. |
\. yahoo \. |
# \b LWP::Simple \b |
\b MS\ Search \b |
\b Webmin \b |
\b Wget \b |
\b teoma \b
/x;
# First pass: build a hash of all client addresses that have downloade
+d more
# than MaxDownloadsPerDay modules in any one day:
my %bigusers;
sub find_big_users($) {
my $fh = $_[0];
seek $fh, 0, 0 or die "Can't rewind the input file:\n$!\n";
print STDERR "Finding heavy users...\n";
my %hpd; # hits per day: $hpd{client}{date} = number of hits
while (<$fh>) {
my ($client, $date) = m/
^ ( \d+ )
\s+
( [^:]+ )
/x or next;
# $hpd{$client}{$date} ||= 0;
++ $hpd{$client}{$date};
}
CLIENT:
while (my ($client, $rdatehash) = each %hpd) {
while (my ($date, $count) = each %$rdatehash) {
undef $bigusers{$client}, next CLIENT if $count > MaxDownl
+oadsPerDay;
}
}
}
# Second pass: ignoring traffic from heavy clients and robotic user ag
+ents,
# build a hash indexed by date, module and version and yielding a coun
+t of
# downloads:
my $rx_parse = qr!
^
( \d+ ) # Get client ID
\s
( [^:]+ ) # Get date
\S+ \s # Skip time
/ \S+ / # Skip directory
( \w \S*? ) # Get module name
- # Skip delimiter
( (?: (?> \d [^.]* ) \.? )+ ) # Get version number
\. \S+ \s # Skip file-type suffix
" ( .* ) " # Get user agent
!x;
my $rawdownloads = 0;
my $igbig = 0;
my $igagent = 0;
my $nrlines;
sub count_downloads($) {
my $fh = $_[0];
seek $fh, 0, 0 or die "Can't rewind the input file:\n$!\n";
print STDERR "Counting downloads...\n";
my %details;
while (<$fh>) {
my ($client, $date, $module, $version, $agent) = /$rx_parse/o
or next;
# print;
# print "Mod $module, ver $version\n";
++$rawdownloads;
++$igbig, next if exists $bigusers{$client};
++$igagent, next if $agent =~ $rx_agent_ignore;
++ $details{$date}{$module}{$version};
}
$nrlines = $.;
\%details;
}
# Third pass: if multiple versions of the same module have been reques
+ted on the
# same day, ignore all but the most popular version for that day. Thi
+s avoids
# giving extra weight to modules with many historical versions if a cl
+ient
# downloads all of them. Produce a hash
my $filtereddownloads = 0;
sub condense_multiple_versions($) {
my $rdetails = $_[0];
print STDERR "Analysing...\n";
my %grosscounts;
while (my ($date, $rmodhash) = each %$rdetails) {
while (my ($module, $rverhash) = each %$rmodhash) {
my @vercounts = sort {$a <=> $b} values %$rverhash;
$grosscounts{$module} += $vercounts[-1];
$filtereddownloads += $vercounts[-1];
}
}
\%grosscounts;
}
# Print the module counts and names in descending order of popularity:
sub print_results($) {
print STDERR "Using $filtereddownloads out of $rawdownloads downlo
+ads on $nrlines lines.\n",
"Skipped $igbig from heavy users and a further $igage
+nt apparently from robots.\n\n";
my $rcounts = $_[0];
my @sorted = sort {$rcounts->{$b} <=> $rcounts->{$a}} keys %$rcoun
+ts;
print map {sprintf "%-8d%s\n", $rcounts->{$_}, $_}
@sorted[0 .. NrToPrint - 1];
}
sub main() {
die "$0 <filename>\n" unless @ARGV == 1;
my $infile = shift @ARGV;
open my $fh, "<$infile" or die "Can't open $infile:\n$!\n";
find_big_users $fh;
print_results
condense_multiple_versions
count_downloads $fh;
}
main;
[~/perl/P100]$ ./rank-modules cpan-gets
Finding heavy users...
Counting downloads...
Analysing...
Using 104411 out of 1067155 downloads on 2328070 lines.
Skipped 767228 downloads from heavy users and a further 177523 apparen
+tly from robots.
2745 DBI
2312 File-Scan
1703 DBD-mysql
1219 XML-Parser
1202 HTML-Parser
1034 libwww-perl
984 GD
944 Gtk-Perl
880 Net_SSLeay.pm
859 Tk
827 DBD-Oracle
793 MIME-Base64
756 URI
751 Apache-ASP
746 Compress-Zlib
654 dmake
643 HTML-Template
640 Digest-MD5
602 Time-HiRes
592 Digest-SHA1
587 Archive-Tar
584 Net-Telnet
577 Template-Toolkit
548 Parallel-Pvm
540 XML-Writer
477 Archive-Zip
467 HTML-Tagset
464 libnet
437 Digest
406 AppConfig
401 MIME-tools
385 MailTools
359 Storable
356 Date-Calc
346 Msql-Mysql-modules
339 Test-Simple
338 CGI.pm
324 Module-Build
320 Spreadsheet-WriteExcel
318 SiePerl
317 perl-ldap
316 Net-DNS
314 DB_File
312 PAR
310 CPAN
310 TermReadKey
297 XML-Simple
297 IO-String
292 TimeDate
291 GDGraph
289 MIME-Lite
287 IO-stringy
287 Crypt-SSLeay
284 Curses
282 DBD-DB2
278 calendar
278 DateManip
277 Net-SNMP
274 Zanas
271 IMAP-Admin
270 MD5
268 ssltunnel
258 sms
257 Digest-HMAC
255 GDTextUtil
252 DBD-ODBC
252 DBD-Pg
245 gmailarchiver
245 IO-Socket-SSL
240 Data-Dumper
239 Mail-Sendmail
232 IOC
225 OLE-Storage_Lite
223 keywordsearch
217 ExtUtils-MakeMaker
206 XML-SAX
205 reboot
200 chres
199 Convert-ASN1
196 App-Info
196 Event
194 CGIscriptor
189 linkcheck
187 Test-Harness
184 glynx
184 Verilog-Perl
181 XLinks
180 Bit-Vector
179 mod_perl
178 SOAP-Lite
176 Expect
174 XML-DOM
174 MARC-Detrans
174 DBD-Sybase
173 Mail-SpamAssassin
172 Excel-Template
172 check_ftp
172 Compress-Zlib-Perl
171 Parse-RecDescent
171 Carp-Clan
[~/perl/P100]$
Update 23 Dec 2004:
I have:
- removed LWP::Simple from the list of ignorable user agents at stvn's suggestion,
- updated the results listing, and
- removed a fantastically noisy debugging statement that I inadvertently left in. (Apologies to anyone who ran the script and got barraged with raw data.)
Markus
| [reply] [d/l] |
|
|
|
|
Wow, excluding the top 1% of downloaders. Brilliant.
Something else to think about: I'm not concerned about absolute rankings so much as developing strata, as in http://qa.perl.org/phalanx/distros.html. I see this sort of like those "Greatest Albums Of All Time" lists. Maybe you can argue about whether Let It Bleed should come before or after Abbey Road, but both belong in the top 10, well before Pleased To Meet Me or Sign O' The Times.
| [reply] |
|
In my occasional noodlings on this topic, I've wondered what the dependency graph looks like. Which modules are most frequently used in other modules? (Including the recursion -- if A uses B and B uses C, D, and E, then the existence of A should increment the dependency count of C, D, and E, too.) Presumably, core modules would have the most links, but there are likely a second strata of heavily used utility modules, and so on out to narrow, single-purpose modules for particular applications. (Though those could also be very popular and worth of inclusion in a top 100 list.)
-xdg
Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.
| [reply] |
Re: Help update the Phalanx 100
by petdance (Parson) on Dec 21, 2004 at 05:56 UTC
|
| [reply] |
Re: Help update the Phalanx 100
by DrHyde (Prior) on Dec 21, 2004 at 10:21 UTC
|
This sounds like a job for ... AWK!
As an aside, it would be interesting to see similar data for a non-US mirror, or indeed for several mirrors from different countries. Several modules are US-specific (eg Geo::Postalcode) and others are of little interest to USians. | [reply] |
Re: Help update the Phalanx 100
by stvn (Monsignor) on Dec 21, 2004 at 17:12 UTC
|
I was talking to dragonchild and he brought up a good point, that many of these downloads could be just CPAN smoke testers. Then talking to Andy, he was unsure if there was any way to tell if they were CPAN smoke testers. Anyone out there have any ideas on how we might be able to tell?
| [reply] |
|
Don't SMOKE testers send an automated result back to CPAN? If they do, one could decrement for every SMOKE result for a given module ...
Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.
| [reply] |
|
| [reply] |
|
| [reply] |
Re: Help update the Phalanx 100
by Maddingue (Sexton) on Dec 26, 2004 at 02:30 UTC
|
Here is a first list of the modules that are the most frequently listed as prereqs in META.yml (which is about a third of the distributions present on the CPAN). Dirty scripts which produced this are available on demand.
87 DBI
63 URI
63 Kwiki
58 Params::Validate
56 Template
56 DateTime
55 LWP::UserAgent
52 POE
46 YAML
41 WWW::Mechanize
40 Class::Accessor
34 LWP
33 XML::LibXML
32 Class::DBI
32 URI::Escape
28 Error
28 HTML::Parser
27 Compress::Zlib
25 mod_perl
25 Parse::RecDescent
24 LWP::Simple
24 Test::Exception
24 Date::Parse
24 XML::Parser
24 Time::Piece
23 Tk
22 Email::Simple
22 HTTP::Request
22 HTML::Entities
22 HTML::Mason
21 File::Find::Rule
21 Test::Pod
20 IO::Scalar
20 Apache::Request
20 Digest::SHA1
20 Clone
19 CGI::Application
18 IO::Zlib
18 File::Slurp
17 XML::Simple
17 UNIVERSAL::require
17 Exception::Class
17 GD
16 DBD::SQLite
16 XML::RSS
15 Class::MethodMaker
15 Cache::Cache
15 Pod::Simple
15 Spiffy
15 Class::Data::Inheritable
15 XML::SAX
14 Image::Size
14 Maypole
14 Carp::Assert
14 HTML::TokeParser
14 Date::Calc
14 Test::Builder::Tester
13 Class::Accessor::Fast
13 Log::Log4perl
13 Object::props
12 HTML::Template
12 CGI::Session
12 DBD::mysql
12 Inline
12 Log::Dispatch
12 XML::XPath
12 HTTP::Cookies
11 SOAP::Lite
11 HTML::TreeBuilder
11 Email::Address
11 XML::LibXSLT
11 Net::LDAP
11 Tie::IxHash
11 Regexp::Common
11 Win32::OLE
11 Term::ReadKey
11 Archive::Tar
10 Net::DNS
10 IO::String
10 HTTP::Date
10 Module::Build
10 CGI::Builder
10 Apache
10 Date::Manip
10 load
10 CGI::Wiki
10 Archive::Zip
10 Wx
10 Apache::Cookie
10 Image::Magick
9 DateTime::Set
9 Gtk2
9 Class::Accessor::Chained
9 Apache::Test
9 IO::All
9 Algorithm::Diff
9 Mail::Sendmail
9 Test::Pod::Coverage
9 Email::MIME
9 Date::Format
9 Mail::Address
9 Math::Symbolic
8 Class::props
8 Email::Valid
8 HTML::TokeParser::Simple
8 Exporter::Lite
8 Apache::Session
8 MP3::Info
8 Locale::Maketext::Simple
7 Object::groups
7 File::Type
7 GraphViz
7 BerkeleyDB
7 Mail::Internet
7 Config::General
7 URI::URL
7 HTTP::Response
7 HTTP::Status
7 Module::Pluggable
7 Text::Template
| [reply] [d/l] |
Re: Help update the Phalanx 100
by gmpassos (Priest) on Dec 23, 2004 at 15:28 UTC
|
I think that this list of 100 modules should be generated by voting and not by a log. Important is not to help the modules that are already in use, but get modules that can bring new resources and possibilities for Perl.
For example, the module wxPerl brings a new world to Perl and really should be in the list, but the log won't show that because the main point where the users get this module is from binaries at SourceForge, since is a module that need to much things to compile our won binaries.
Also we have a lot of new modules that are very interesting and should be in the list, or CORE modules that are not downloaded, like threads, that are not 100% yet.
Graciliano M. P.
"Creativity is the expression of liberty".
| [reply] |
|
| [reply] |
Re: Help update the Phalanx 100
by cosimo (Hermit) on Jan 04, 2005 at 13:20 UTC
|
Hi Andy,
I put the log entries in a postgres db, and now I'm running some sql queries with a "run-and-check-result" approach. What I'm trying to do is applying my personal own "common sense" interpretation to that data and see what comes out. Tonight I should see some interesting result.
| [reply] |
Put it in a database...
by Anonymous Monk on Jan 10, 2005 at 19:11 UTC
|
I worked on this over the holidays. I got something kludged together that generated results in less than 4 minutes using only the ip id's that had been used for 100 or fewer gets. I achieved this by munging the cpan-gets logfile and turning it into a database dump file suitable for import into SQLite or PostgreSQL. Since the problem is fundamentally about munging out the significant data and querying it, using a structured querying language seemed a natural fit. The issue of counting core modules and recursive dependencies is outside the domain of what I chose to focus on.
I've attached my script. It takes the name of the log file as a parameter and generates a phalanx.dump file in the current working directory. You can then use a command like:
cat phalanx.dump|sqlite3 phalanx.db
To create the database. It isn't particularly elegant, but it gets the job done.
Do you realize that 277,373 of the cpan-gets records are duplicates?
I broke each record out into separate vars for timestamp, ipaddr id, module, module version, and useragent. While doing this for each record, I generated id's for each unique module, module/version, and useragent string. Finally, I generated a unique get id using all of the above as the hash key. I.e. something like:
while (...) {
...
if ($g{$ts, $ip_id, $mod_id, $modver_id, $agent_id}) {
$duplicates++;
} else {
$g{$ts, $ip_id, $mod_id, $modver_id, $agent_id} = ++$gid;
}
}
After processing cpan-gets, printing the value of $duplicates gives: 277373. I highly suspect cpan-gets has duplicate data. It isn't likely after all that the exact same version of a module was downloaded by the exact same IP and UserAgent at the exact same time...
Here's an SQL query to get the top 100:
SELECT Module.Name, COUNT(Gets.Module_id)
FROM Module
INNER JOIN Gets
ON Module.ID = Gets.Module_id
INNER JOIN (
SELECT IPAddr_id, COUNT(*) as Tally FROM Gets
GROUP BY IPAddr_id HAVING Tally <= 100
) as gip
ON Gets.IPAddr_id = gip.IPAddr_id
GROUP BY Module.Name
ORDER BY COUNT(Gets.Module_id) DESC;
I'd never used SQLite directly before, so I wasn't sure how best to twiddle the indexes to give it reasonable performance. I was very disappointed with the performance of queries using simple joins. Without indexes, the performance is awful. Using the sub-select makes it churn long enough for the powersaving setting on my laptop to kick into hibernation. Probably yanking the sub-select out into a temporary table and putting indexes on that would give better results, but I ran out of time. So, if you're using SQLite, you might want to run the simpified version that follows.
SELECT Module.Name, COUNT(Gets.Module_id)
FROM Module
INNER JOIN Gets
ON Module.ID = Gets.Module_id
GROUP BY Module.Name
ORDER BY COUNT(Gets.Module_id) DESC;
The nice thing about putting the data into a database, is that it then becomes simple for people to tinker with and generate their own queries...
Code follows...
#!/usr/bin/perl -s -w
our (%month);
@month{qw{Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec}} = (1..12);
my ($id, $ts, $path, $agent, $ext, $module, $version);
my (%ip, %a, %m, %mv, %g);
my ($aid, $mid, $mvid, $gid);
$aid=$mid=$mvid=$gid=0;
my $count=0;
my $dups=0;
while (<>) {
next unless m/^(\S+) (\S+) (\S+) "(.+)"$/;
($id,$ts,$path,$agent) = ($1,$2,$3,$4);
$agent=~s/'/''/g;
next unless $path =~ m/\.(tar\.gz|par|tgz|zip|pm|pm\.gz)$/;
$ext = $1;
# DD/MMM/YYYY:HH:mm:SS => YYYY/MM/DD HH:mm:SS
$ts =~ s/(\d{2})\/(\w{3})\/(\d{4}):(.+)$/$3\/$month{$2}\/$1 $4/;
# /authors/id/S/SP/SPACEMAN/WebJay.pm
($module) = $path =~ m/(^\/+)\Q.$ext\E$/;
($version) = $module =~ m/(?:-(^-+))$/;
$module =~ s/(?:-(^-+))$//;
$module =~ s/\.pm$//;
$version ||= 'undef';
$ip{$id}||=$id;
$a{$agent}||=++$aid;
$m{$module}||=++$mid;
$mv{$m{$module},$version}||=++$mvid;
if ($g{$ts,$id,$mid,$mvid,$aid}) {
$dups++;
} else {
$g{$ts,$id,$mid,$mvid,$aid}=++$gid;
}
if ($count++>=1000) {
$count=0;
print '.';
}
}
print "$dups duplicates exist\n";
open OUT, '>phalanx.db';
print OUT <<EOF;
BEGIN TRANSACTION;
CREATE TABLE IPAddr (ID INTEGER PRIMARY KEY);
EOF
print OUT <<"EOF" for sort {$a<=>$b} keys %ip;
INSERT INTO "IPAddr" VALUES($_);
EOF
print OUT "CREATE TABLE Agent (ID INTEGER PRIMARY KEY, Desc VARCHAR(256));\n";
print OUT <<"EOF" for sort {$a{$a}<=>$a{$b}} keys %a;
INSERT INTO "Agent" VALUES($a{$_}, '$_');
EOF
print OUT "CREATE TABLE Module (ID INTEGER PRIMARY KEY, Name VARCHAR(256));\n";
print OUT <<"EOF" for sort {$m{$a}<=>$m{$b}} keys %m;
INSERT INTO "Module" VALUES($m{$_}, '$_');
EOF
my $chr28=chr(28);
print OUT "CREATE TABLE ModVer (ID INTEGER PRIMARY KEY, Module_id INTEGER,\n Version VARCHAR(16));\n";
for my $key (sort {$mv{$a}<=>$mv{$b}} keys %mv) {
my ($mid,$version) = split /$chr28/, $key;
my ($mvid) = $mv{$key};
print OUT <<"EOF";
INSERT INTO "ModVer" VALUES($mvid, $mid, '$version');
EOF
}
# $g{$ts,$id,$mid,$mvid,$aid}||=++$gid;
print OUT "CREATE TABLE Gets (ID INTEGER PRIMARY KEY, ts TIMESTAMP, IPAddr_id INTEGER,\n Module_id INTEGER, ModVer_id INTEGER, Agent_id INTEGER);\n";
for my $key (sort {$g{$a}<=>$g{$b}} keys %g) {
my ($ts,$id,$mid,$mvid,$aid) = split /$chr28/, $key;
my ($gid) = $g{$key};
print OUT <<"EOF";
INSERT INTO "Gets" VALUES($gid, '$ts', $id, $mid, $mvid, $aid);
EOF
}
print OUT <<EOF;
CREATE INDEX fk_module_id ON Gets (Module_id);
COMMIT;
EOF
__END__
SELECT Module.Name, COUNT(Gets.Module_ID)
FROM Module
INNER JOIN Gets
ON Module.ID = Gets.Module_id
INNER JOIN (
SELECT IPAddr_id, COUNT(*) as Tally FROM Gets
GROUP BY IPAddr_id HAVING Tally <= 100
) as gip
ON Gets.IPAddr_id = gip.IPAddr_id
GROUP BY Module.Name
ORDER BY COUNT(Gets.Module_ID) ASC;
SELECT IPAddr_ID, COUNT(*) FROM Gets GROUP BY IPAddr_ID ORDER BY IPAddr_id ASC;
| [reply] |
|
|