Syntactic Confectionery Delight | |
PerlMonks |
Put it in a database...by Anonymous Monk |
on Jan 10, 2005 at 19:11 UTC ( [id://421046]=note: print w/replies, xml ) | Need Help?? |
I worked on this over the holidays. I got something kludged together that generated results in less than 4 minutes using only the ip id's that had been used for 100 or fewer gets. I achieved this by munging the cpan-gets logfile and turning it into a database dump file suitable for import into SQLite or PostgreSQL. Since the problem is fundamentally about munging out the significant data and querying it, using a structured querying language seemed a natural fit. The issue of counting core modules and recursive dependencies is outside the domain of what I chose to focus on. I've attached my script. It takes the name of the log file as a parameter and generates a phalanx.dump file in the current working directory. You can then use a command like:
cat phalanx.dump|sqlite3 phalanx.db To create the database. It isn't particularly elegant, but it gets the job done. Do you realize that 277,373 of the cpan-gets records are duplicates? I broke each record out into separate vars for timestamp, ipaddr id, module, module version, and useragent. While doing this for each record, I generated id's for each unique module, module/version, and useragent string. Finally, I generated a unique get id using all of the above as the hash key. I.e. something like: while (...) { ... if ($g{$ts, $ip_id, $mod_id, $modver_id, $agent_id}) { $duplicates++; } else { $g{$ts, $ip_id, $mod_id, $modver_id, $agent_id} = ++$gid; } } After processing cpan-gets, printing the value of $duplicates gives: 277373. I highly suspect cpan-gets has duplicate data. It isn't likely after all that the exact same version of a module was downloaded by the exact same IP and UserAgent at the exact same time... Here's an SQL query to get the top 100: SELECT Module.Name, COUNT(Gets.Module_id) FROM Module INNER JOIN Gets ON Module.ID = Gets.Module_id INNER JOIN ( SELECT IPAddr_id, COUNT(*) as Tally FROM Gets GROUP BY IPAddr_id HAVING Tally <= 100 ) as gip ON Gets.IPAddr_id = gip.IPAddr_id GROUP BY Module.Name ORDER BY COUNT(Gets.Module_id) DESC; I'd never used SQLite directly before, so I wasn't sure how best to twiddle the indexes to give it reasonable performance. I was very disappointed with the performance of queries using simple joins. Without indexes, the performance is awful. Using the sub-select makes it churn long enough for the powersaving setting on my laptop to kick into hibernation. Probably yanking the sub-select out into a temporary table and putting indexes on that would give better results, but I ran out of time. So, if you're using SQLite, you might want to run the simpified version that follows. SELECT Module.Name, COUNT(Gets.Module_id) FROM Module INNER JOIN Gets ON Module.ID = Gets.Module_id GROUP BY Module.Name ORDER BY COUNT(Gets.Module_id) DESC; The nice thing about putting the data into a database, is that it then becomes simple for people to tinker with and generate their own queries... Code follows... #!/usr/bin/perl -s -w our (%month); @month{qw{Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec}} = (1..12); my ($id, $ts, $path, $agent, $ext, $module, $version); my (%ip, %a, %m, %mv, %g); my ($aid, $mid, $mvid, $gid); $aid=$mid=$mvid=$gid=0; my $count=0; my $dups=0; while (<>) { next unless m/^(\S+) (\S+) (\S+) "(.+)"$/; ($id,$ts,$path,$agent) = ($1,$2,$3,$4); $agent=~s/'/''/g; next unless $path =~ m/\.(tar\.gz|par|tgz|zip|pm|pm\.gz)$/; $ext = $1; # DD/MMM/YYYY:HH:mm:SS => YYYY/MM/DD HH:mm:SS $ts =~ s/(\d{2})\/(\w{3})\/(\d{4}):(.+)$/$3\/$month{$2}\/$1 $4/; # /authors/id/S/SP/SPACEMAN/WebJay.pm ($module) = $path =~ m/(^\/+)\Q.$ext\E$/; ($version) = $module =~ m/(?:-(^-+))$/; $module =~ s/(?:-(^-+))$//; $module =~ s/\.pm$//; $version ||= 'undef'; $ip{$id}||=$id; $a{$agent}||=++$aid; $m{$module}||=++$mid; $mv{$m{$module},$version}||=++$mvid; if ($g{$ts,$id,$mid,$mvid,$aid}) { $dups++; } else { $g{$ts,$id,$mid,$mvid,$aid}=++$gid; } if ($count++>=1000) { $count=0; print '.'; } } print "$dups duplicates exist\n"; open OUT, '>phalanx.db'; print OUT <<EOF; BEGIN TRANSACTION; CREATE TABLE IPAddr (ID INTEGER PRIMARY KEY); EOF print OUT <<"EOF" for sort {$a<=>$b} keys %ip; INSERT INTO "IPAddr" VALUES($_); EOF print OUT "CREATE TABLE Agent (ID INTEGER PRIMARY KEY, Desc VARCHAR(256));\n"; print OUT <<"EOF" for sort {$a{$a}<=>$a{$b}} keys %a; INSERT INTO "Agent" VALUES($a{$_}, '$_'); EOF print OUT "CREATE TABLE Module (ID INTEGER PRIMARY KEY, Name VARCHAR(256));\n"; print OUT <<"EOF" for sort {$m{$a}<=>$m{$b}} keys %m; INSERT INTO "Module" VALUES($m{$_}, '$_'); EOF my $chr28=chr(28); print OUT "CREATE TABLE ModVer (ID INTEGER PRIMARY KEY, Module_id INTEGER,\n Version VARCHAR(16));\n"; for my $key (sort {$mv{$a}<=>$mv{$b}} keys %mv) { my ($mid,$version) = split /$chr28/, $key; my ($mvid) = $mv{$key}; print OUT <<"EOF"; INSERT INTO "ModVer" VALUES($mvid, $mid, '$version'); EOF } # $g{$ts,$id,$mid,$mvid,$aid}||=++$gid; print OUT "CREATE TABLE Gets (ID INTEGER PRIMARY KEY, ts TIMESTAMP, IPAddr_id INTEGER,\n Module_id INTEGER, ModVer_id INTEGER, Agent_id INTEGER);\n"; for my $key (sort {$g{$a}<=>$g{$b}} keys %g) { my ($ts,$id,$mid,$mvid,$aid) = split /$chr28/, $key; my ($gid) = $g{$key}; print OUT <<"EOF"; INSERT INTO "Gets" VALUES($gid, '$ts', $id, $mid, $mvid, $aid); EOF } print OUT <<EOF; CREATE INDEX fk_module_id ON Gets (Module_id); COMMIT; EOF __END__ SELECT Module.Name, COUNT(Gets.Module_ID) FROM Module INNER JOIN Gets ON Module.ID = Gets.Module_id INNER JOIN ( SELECT IPAddr_id, COUNT(*) as Tally FROM Gets GROUP BY IPAddr_id HAVING Tally <= 100 ) as gip ON Gets.IPAddr_id = gip.IPAddr_id GROUP BY Module.Name ORDER BY COUNT(Gets.Module_ID) ASC; SELECT IPAddr_ID, COUNT(*) FROM Gets GROUP BY IPAddr_ID ORDER BY IPAddr_id ASC;
In Section
Meditations
|
|