Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Procrastination is a joy in the week before Christmas...

I've taken a quick and dirty cut at it (please, no deductions for bad style or inefficiency). Like stvn said, the regex parts could use some work, but, frankly, there are so many special cases of how people name and number stuff that it's hard to cover everything. (Can anyone post what PAUSE and CPAN use internally?) The basic logic I use is this:

  • After seeing the dates and UA's and running some basic counts on the number of hits by IP, I decided to simply drop as outliers any IPs in the top 1% of IPs by hitcount. The cutoff winds up being about 64 hits over the period studied.
  • Only process things ending in .tar.gz or .tgz to avoid lots of non distribution stuff (scripts, readme, .sig, etc.)
  • Strip off the .tar.gz or .gz; Strip off the distribution version. (Ugly hack of a regex, given all the varations people use but s/(-|_)[0-9][0-9a-z._]+$// seems to work decently.)
  • Strip a trailing .pm, more for style than anything else
  • Find the 1% cutoff of IP's
  • Count the number of distribution downloads (but only if the downloading IP isn't past the cutoff.)
  • Print the top 100

There's probably some additional, case-by-case, cleanup that could be done. (E.g., "perl" itself is in the top 100), but I think this is a decent start. Code and results follow.

#!/usr/bin/perl use warnings; use strict; use Data::Hash::Totals; my @log; while (my $line = <>) { next unless $line =~ m!(\S+)\s+(\S+)\s+\S*/(\S+)\s!; my ($ip,$date,$dist) = ($1, $2, $3); next unless $dist =~ s/\.(tar\.gz|tgz)$//; $dist =~ s/(-|_)[0-9][0-9a-z._]+$//; $dist =~ s/.pm$//; push @log, { ip => $ip, date => $date, dist => $dist }; } # Count IP and find a cutoff for the 99%ile of downloaders my %ip; for my $line (@log) { $ip{$line->{ip}}++; } my $cut = int( 0.01 * keys %ip ); my $cutoff = [ sort { $ip{$b} <=> $ip{$a} } keys %ip ]->[$cut]; # Tally distributions for everyone else my %dist; for my $line (@log) { $dist{$line->{dist}}++ if $ip{$line->{ip}} < $cutoff; } my %top100; $top100{$_} = $dist{$_} for splice( @{[sort { $dist{$b} <=> $dist{$a} } keys %dist ]}, 0, +100 ); print as_table(\%top100);
17596 Net_SSLeay 13732 DBD-mysql 11138 DBI 8226 perl-ldap 7542 Mail-SpamAssassin 5528 GD 5440 libwww-perl 4557 HTML-Parser 3865 Digest-SHA1 3449 Digest 3397 CGI 3260 MIME-Base64 2868 XML-Parser 2786 Digest-MD5 2635 DBD-Pg 2630 MIME-tools 2625 File-Scan 2530 Compress-Zlib 2236 URI 2173 Net-DNS 2136 Time-HiRes 2130 Archive-Tar 2001 Test-Simple 1904 Tk 1767 DateManip 1743 Digest-HMAC 1650 HTML-Tagset 1629 MailTools 1617 libnet 1540 Gtk-Perl 1476 DB_File 1470 Archive-Zip 1418 DBD-Oracle 1400 Msql-Mysql-modules 1286 Apache-ASP 1286 HTML-Template 1138 Template-Toolkit 1134 IO-stringy 1124 Apache-MP3 1109 mod_perl 1087 MD5 1008 Storable 998 Module-Build 995 Crypt-CBC 972 Net-Telnet 952 CPAN 918 XML-Writer 916 Date-Calc 908 IMAP-Admin 900 TimeDate 836 Convert-ASN1 829 AppConfig 817 IO-String 800 GDGraph 787 Net-SNMP 783 MIME-Lite 783 XML-Generator 782 BerkeleyDB 773 Curses 763 AcePerl 760 PathTools 757 TermReadKey 747 Crypt-SSLeay 726 Convert-TNEF 714 Zanas 703 ExtUtils-MakeMaker 691 IO-Socket-SSL 662 HTML-Mason 655 Test-Harness 653 XML-Simple 624 bioperl 616 DBIx-SQLEngine 608 IO-Zlib 603 PodParser 601 GDTextUtil 599 PerlMagick 597 Parallel-Pvm 596 SOAP-Lite 571 Authen-SASL 557 AxKit-App-TABOO 557 Spreadsheet-WriteExcel 553 Bit-Vector 553 Data-Dumper 544 Parse-RecDescent 542 App-Info 533 perl 529 DBD-ODBC 528 Net-Server 525 Authen-PAM 520 Crypt-DES 519 Config-Maker 514 Bio-Das 512 File-Tail 505 Excel-Template 502 Boulder 502 XML-LibXML 500 Mail-ClamAV 498 IOC 496 Event 485 Apache-Session

-xdg

Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.


In reply to Re: Help update the Phalanx 100 by xdg
in thread Help update the Phalanx 100 by petdance

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-03-29 05:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found