Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

distr - show distribution of column values

by Corion (Patriarch)
on May 23, 2008 at 13:45 UTC ( [id://688145]=sourcecode: print w/replies, xml ) Need Help??
Category: Utility Scripts
Author/Contact Info Corion
Description:

This program returns a quick tally of the different values for a column. My primary use for this program is to find out the most common date value in a file, to rename that file to that date. It is also very convenient to use this program to get a quick overview over the distribution of lengths, especially for numbers.

Currently, I'm "confident" that I'm picking the right value as the maximum value if the value occurs in at least 60% of the rows of the sample I'm taking. This has shown to be sufficient, but better would be an estimator that determined the size of the sample or expanded the sample as long as there was not enough confidence in the "modus".

#!/usr/bin/perl -w
use strict;
use Getopt::Long;

GetOptions(
    'lines|n:i' => \my $lines,
    'column|c:i' => \my $column,
    'sep|s:s' => \my $separator,
    'transform|f:s' => \my $transform,
    'max|m' => \my $maximum_only,
);
$lines ||= 10000;
$column ||= 1;
$separator ||= ";"; # should from the input, but...

$column--; # adjust from human to Perl

my %vals;
my @F;
my $line=0;

sub transform{ $_[0] };

if ($transform) {
    no warnings 'redefine';
    eval <<CODE
sub transform { $transform(\$_[0]) };
CODE
};

FILE: for my $file (@ARGV) {
    my $fh;
    if ($file =~ /\.gz|\.ebcdic/) {
        open $fh, '-|', 'gzcat', $file
            or die "Couldn't open '$file': $!";
    } else {
        open $fh, '<', $file
            or die "Couldn't open '$file': $!";
    };
    while (<$fh>) {
        @F = split /$separator/o;
        $vals{ transform($F[ $column ])}++;

        last FILE if $lines <= $line++;
    };
};

for (sort { $vals{$b} <=> $vals{$a} } keys %vals) {
    if ($maximum_only) {
        if ($vals{$_} / $lines > 0.6) {
            print "$_\n";
            last
        } else {
            die "No confidence in '$_': Only $vals{$_} out of $lines v
+alues match\n";
        };
    } else {
        print "$_: $vals{$_}\n";
    };
};

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://688145]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-23 16:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found