Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
The Phalanx 100 is a list of the "top 100" modules on CPAN, and by extension, those that should have the most attention paid to them by the Phalanx project.

The first time I generated the P100 was over a year ago, and things are old and stale. Distributions have changed names (CGI::Kwiki is now Kwiki, for example). Some distros have come and some have gone. It's time to be updated.

This time, YOU can help determine the P100. The source data, generated from logs from the main CPAN mirror at pair.com, is available for download at http://petdance.com/random/cpan-gets.gz. Write code that analyzes the data, and generates the top 100 modules.

What should your code do? It's up to you! Publish the code somewhere (use.perl.org, perlmonks, whatever) and let me see it. I'm not sure if I'll take someone's decisions directly, or use ideas, or how I'll do it, but the more working code I have to pick from, the better.

Also, the last time I created a P100, I omitted any modules that were in the core distribution. This time, I do want to include core modules, although I do want to have them noted somehow. Richard Clamp's Module::CoreList will be a great help with this.

Whatever you do, however you do it, I need to know about your code no later than January 10th, 2005. Email me at andy at petdance.com. There's going to be an article about the Phalanx project going up on perl.com soon after that, and I need to have an updated version of the P100 up (replacing http://qa.perl.org/phalanx/distros.html) by then.

About the data

I used the following code to analyze data from the Apache logs for the main CPAN mirror at Pair.com from November 1 to December 15th, 2004.

#!/usr/bin/perl use strict; use warnings; my %id; my $next_id = 10000; while (<>) { next unless m!^\S+ (\S+) .+ "GET ([^"]+) HTTP/\d\.\d" 200!; my ($ip,$path) = ($1,$2); study $path; # Skip directories next if $path =~ /\/$/; # Directory next if $path =~ /\/\?/; # Directory with sort parms # Skip certain directories next if $path =~ /^\/(icons|misc|ports|src)\//; # Skip certain file extensions next if $path =~ /\.(rss|html|meta|readme)$/; # Skip CPAN & distro maintenance stuff next if $path =~ /CHECKSUMS$/; next if $path =~ /MIRRORING/; # Module list stuff next if $path =~ /\Q00whois./; next if $path =~ /\Q01mailrc./; next if $path =~ /\Q02packages.details/; next if $path =~ /\Q03modlist./; my $id = ($id{$ip} ||= ++$next_id); print "$id $path\n"; }
This gives lines like this:
16395 /authors/id/K/KE/KESTER/WWW-Yahoo-DrivingDirections-0.07.tar +.gz 10001 /authors/id/K/KW/KWOOLERY/Buzznet-API-0.01.tar.gz 85576 /authors/id/J/JR/JROGERS/Net-Telnet-3.01.tar.gz 85576 /authors/id/J/JR/JROGERS/Net-Telnet-3.02.tar.gz 85576 /authors/id/J/JR/JROGERS/Net-Telnet-3.03.tar.gz
The 5-digit number is an ID number for a given IP address. I found that some IPs were routinely slurping down entire histories of modules, which probably will skew statistics to those with a lot of revisions.

How should these be accounted for in the analysis? I don't know. That's one of the reasons that I put this out for all to work on.

I welcome your comments, suggestions and help on this.

xoxo,
Andy


In reply to Help update the Phalanx 100 by petdance

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-16 21:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found