RFC: Array::GroupBy

After reading of bradcathey's plight in Ways to group elements of an AoH, I thought this task might be worth wrapping up in a CPAN module, so I wrote Array::GroupBy. I present it here for the consideration of the monks so that I may grow from your wisdom, creativity, and tomfoolery.

package Array::GroupBy;

$VERSION = '0.0.1';

use strict;
use warnings;

use Carp;
use List::Util qw( min max sum );
use Moose;
use Moose::Util::TypeConstraints;

# These get old value first, then added value
my %code_for;
{
    no warnings 'uninitialized';

    %code_for = (
        sum => {
            itemproc => sub { shift() + shift() }
        },
        min => {
            itemproc => sub { defined $_[0] ? min( shift, shift ) : $_
+[1] }
        },
        max => {
            itemproc => sub { defined $_[0] ? max( shift, shift ) : $_
+[1] }
        },
        avg => {
            listproc => sub { sum(@_) / scalar @_ }
        },
        count => {
            itemproc => sub { 1 + shift }
        },
    );
}

subtype 'Array::GroupBy::TC::GrouperHow'
    => as 'HashRef'
    => where {
        0 == grep {
            !defined
              || !(
                # Allow a bare code reference
                ref $_ eq 'CODE'
                # allow a hash ref with listproc/itemproc => code ref
                || (
                    ref $_ eq ref {}
                    && (   ref $_->{listproc} eq 'CODE'
                        || ref $_->{itemproc} eq 'CODE' )
                )
                # Allow the name of an item in %code_for
                || ( ref $_ eq q{} && exists $code_for{ lc $_ } )
              )
        } values %{$_};
    };

has 'group_by' => (
    is       => 'rw',
    isa      => 'ArrayRef',
    required => 1,
    trigger  => sub { croak 'empty group_by' if ! @{ $_[1] } },
);
has 'group_how' => (
    is      => 'rw',
    isa     => 'Array::GroupBy::TC::GrouperHow',
    trigger => \&_normalize_methods,
);

# This is called when group_how is set.
# It turns every value into a hash ref with listproc/itemproc
sub _normalize_methods {
    my ( $self, $group_spec ) = @_;

    foreach my $method ( values %{$group_spec} ) {
        if ( ref $method eq q{} && exists $code_for{ lc $method } ) {
            $method = $code_for{ lc $method };
        }
        elsif ( ref $method eq 'CODE' ) {
            $method = { itemproc => $method };
        }
    }

    return;
}

sub group {
    my $self = shift;

    my @group_keys = @{ $self->group_by() };

    # This is a H(oH)* where each level is a key from group_by
    # e.g., When group_by = [ 'first', 'last' ]
    # $grouped{'fred'}{'flintstone'} is a reference to a record with
    #                                those keys' aggregates
    my %grouped;

    # These are the output records.  They're also in the leaves of %gr
+ouped
    my @out;

    foreach my $item (@_) {

        # walk out to the right leaf for this record
        my $leaf = \%grouped;
        foreach my $gk (@group_keys) {
            $leaf = ( $leaf->{$item->{$gk}} ||= {} );
        }

        # if this is a new key set, put the new agg. record in @out
        if ( !keys %{$leaf} ) {
            push @out, $leaf;
        }

        $self->_summarize( $leaf, $item );
    }

    # Apply each list processor to their respective fields
    my %group_how = %{ $self->group_how() };
  FIELD:
    foreach my $field ( keys %group_how ) {
        my $listproc = $group_how{$field}->{listproc};
        next FIELD if !$listproc;
        foreach my $group (@out) {
            $group->{$field} = $listproc->( @{ $group->{$field} || [] 
+} );
        }
    }

    return @out;
}

sub _summarize {
    my ( $self, $leaf, $item ) = @_;

    my %group_how = %{ $self->group_how() };
    my %group_by = map { $_ => 1 } @{ $self->group_by() };

    # fields in either the time or the %group_how
    my @fields = keys %{{ map { $_ => 1 } keys %{$item}, keys %group_h
+ow }};

    foreach my $field ( @fields ) {
        if ( exists $group_by{$field} ) {
            $leaf->{$field} ||= $item->{$field};
        }
        elsif ( !exists $group_how{$field} ) {
            $leaf->{$field}++;
        }
        elsif ( exists $group_how{$field}->{itemproc} ) {
            my $proc = $group_how{$field}->{itemproc};
            $leaf->{$field} = $proc->( $leaf->{$field}, $item->{$field
+} );
        }
        elsif ( exists $group_how{$field}->{listproc} ) {
            push @{ $leaf->{$field} }, $item->{$field};
        }
        else {
            croak 'programmer error: impossible condition';
        }
    }

    return;
}

1;    # Magic true value required at end of module
__END__

=head1 NAME

Array::GroupBy - Group an array of records by some key (as with SQL GR
+OUP BY)

=head1 VERSION

This document describes Array::GroupBy version 0.0.1

=head1 SYNOPSIS

    use Array::GroupBy;

    my $grouper = Array::GroupBy->new(
        group_by  => [ 'key1', 'key2' ],
        group_how => {
            field1 => { listproc => \&list_aggregator },
            field2 => 'sum',
            field3 => \&item_aggregator,
            field4 => { itemproc => \&item_aggregator },
        },
    );
    my @grouped = $grouper->group( @AoH );

=head1 DESCRIPTION

Use Array::GroupBy to turn a list of records into records that have th
+eir
values aggregated according to some function, grouped by some key.
This is meant to be similar to how the
"GROUP BY" feature in SQL operates.  This documentation assumes some
familiarity with that feature.

=head1 METHODS

=head2 group

This is an instance method that takes a list of hashes and returns a
list of hashes with the field data grouped according to the configurat
+ion
in the object's attributes.

=head2 group_how

This is an accessor for the grouping definition.

With no arguments, this returns the grouping definition (a hash refere
+nce).

With a hash reference argument, this sets the grouping definition.

B<Note> that the definition you set is normalized, so it may not come 
+back
out the same way it went in.

=head3 Format

The format of C<group_how> is a hash where each key is a field name fr
+om the
input and the keys values define how that field is to be aggregated.
The aggregation value can be one of:

=over

=item A code reference

This will be called for each record in the input.
It's given two arguments, the old value of the field, and the value to
+ be
added into the aggregate.  The return value will be stored as the
aggregate value.

The aggregator should be ready to get an undef value for either argume
+nt.
The first call will always have undef as the first argument.  The seco
+nd
argument could be undef if that's the value of some field in the input
+.

As an example, this would select the longest field value:

   sub {
       no warnings 'uninitialized';

       if ( defined $_[0] && length $_[0] > length $_[1] ) {
           return $_[0];
       }
       else {
           return $_[1];
       }
   }

=item The name of a prewritten aggregator

These aggregators are already written and can be referenced by name.

=over

=item B<sum>

Sum of the input values.

=item B<min>

Minimum numeric value of the field.

=item B<max>

Maximum numeric value of the field.

=item B<avg>

The average (arithmetic mean) of input values.

=item B<count>

The number of records with the group_by key fields.

=back

=item A hash reference with a code reference

This aggregator definition looks like this:

   { listproc => \&code_reference }

This is to allow an aggregator that runs over the whole list of field 
+values
found rather than aggregating them one at a time.  In this case,
Array::GroupBy will build up an array of all the field values and pass
them to the aggregator for each group found after scanning the input r
+ecords.

You can also specify an item-by-item processor this way.

   { itemproc => \&item_processor }

=back

=head3 Defaults

The default behavior for any field that's not in the C<group_by> attri
+bute
and does not have any aggregator defined is 'count'.

Every aggregator is called for every record.  If you define an aggrega
+tor
for a field that does not appear in any record, it will get C<undef>
values on every call, and the field I<will> appear in the output recor
+ds.
This can be a convenient way of getting a "count(*)" field.

  my $grouper = Array::GroupBy->new(
      group_by  => [ 'name' ],
      group_how => { counter => 'count' }
  );

That aggregator would work similarly to this SQL query:

  SELECT name,count(*) AS counter FROM t GROUP BY name;

=head2 group_by

This is an accessor for the list of keys used for grouping.

With no arguments, this returns a reference to an array of keys.

Call it with a reference to an array of strings to set the attribute.

=head2 meta

This is a method added by Moose which provides access
to the current class's metaclass.

=head1 DIAGNOSTICS

=over

=item empty group_by

This is the error you get if you try to set the group_by attribute
to an empty array.

=item Attribute (I<x>) does not pass the type constraint

This error is generated by Moose, so it may vary with different versio
+ns
of Moose.  This is the error you get when you try to set attribute I<x
+>
to a value that's not allowed by the type constraints on it.

=back

=head1 DEPENDENCIES

=over

=item Moose

I used Moose 0.17 when writing this.  I expect any later version shoul
+d work.

=back

Every other module I used is a core module.

=head1 BUGS AND LIMITATIONS

No bugs have been reported.

Please report any bugs or feature requests to
C<bug-array-groupby@rt.cpan.org>, or through the web interface at
L<http://rt.cpan.org>.

=over

=item There's no way to aggregate one field multiple ways

For example, if you want the min I<and> max of some field, the interfa
+ce
doesn't allow a simple way to get them both.

=item If you catch the exception thrown by a mutator, the object may b
+e in
an odd state.

I'd expect attributes to hold their previous values when setting fails
+ this
way, but apparently they don't.  This behavior is dependent on Moose, 
+so
maybe it's different in later versions.

=back

=head1 AUTHOR

Kyle Hasselbacher  C<< <kyleha@gmail.com> >>

=head1 LICENCE AND COPYRIGHT

Copyright (c) 2009, Kyle Hasselbacher C<< <kyleha@gmail.com> >>.
All rights reserved.

This module is free software; you can redistribute it and/or
modify it under the same terms as Perl itself. See L<perlartistic>.

=head1 DISCLAIMER OF WARRANTY

BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WH
+EN
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. TH
+E
ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH
YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL
NECESSARY SERVICING, REPAIR, OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE
LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL,
OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE
THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
[download]

Things that particularly concern me:

The name. Having written Array::GroupBy, I think it should be called Array::Grouper. Then again, maybe "array" isn't right since it operates more specifically on an AoH, and maybe the monks have an even better idea.
The documentation. In my brief search for a way to describe what GROUP BY does, one of the better "descriptions" said it's hard to describe and it's best to learn by example. I punted on that, and when writing the docs I often got the feeling I wasn't describing things very well. Any direction here would be appreciated.
My Moose use. This is the first thing I've done with Moose besides play.

Thank you for your thoughts!

Comment on RFC: Array::GroupBy Select or Download Code

Replies are listed 'Best First'.

Re: RFC: Array::GroupBy
by jdporter (Paladin) on Jan 06, 2009 at 04:37 UTC

Or maybe just use DBI with DBD::Sponge - "Create a DBI statement handle from Perl data".
That uses SQL::Statement::Syntax, which supports GROUP BY.

[reply]
[d/l]

Re^2: RFC: Array::GroupBy (DBD::Sponge)

by kyle (Abbot) on Jan 06, 2009 at 21:19 UTC

Can you show an example of how you'd use those modules to do what mine does? From the documentation I'm looking at, it appears DBD::Sponge is used for making some data appear to be from some SQL, not for doing any kind of transformation.

use DBI;

my $sponge = DBI->connect( 'dbi:Sponge:', '', '', { RaiseError => 1 } 
+);
my $sth = $sponge->prepare( 'SELECT name,count(*) FROM ??? GROUP BY na
+me',
                            { NAME => [ 'name', 'count' ],
                              rows => [ [ 'fred    ', 10 ],
                                        [ 'john    ', 20 ],
                                        [ 'jdporter', 30 ],
                                        [ 'jdporter', 40 ], ] } );
while ( my @row = $sth->fetchrow_array ) {
    print join ', ', @row;
    print "\n";
}
__END__
fred    , 10
john    , 20
jdporter, 30
jdporter, 40
[download]

[reply]
[d/l]

Re^3: RFC: Array::GroupBy (DBD::Sponge)

by jdporter (Paladin) on Jan 06, 2009 at 22:07 UTC

You're right, my bad. Despite allegedly supporting SQL::Statement::Syntax, in fact DBD::Sponge essentially ignores the statement and does a plain ol' SELECT *.... and the only use I can envision for that is as a bridge between your perl data and some black-box module which requires a statement handle.

The documentation of DBD::Sponge's prepare() is pretty unclear on this issue:

The $statement here is an arbitrary statement or name you want to provide as identity of your data. If you're using DBI::Profile it will appear in the profile data.

Generally it's expected that you are preparing a statement handle as if a select statement happened.

Anyway, sorry for the red herring.

Update: I checked into a couple of DBD modules which allow to access in-memory perl data structures as sources for DBI — DBD::AnyData and DBD::RAM (both, like SQL::Statement and other parts of the DBIverse, by jZed). Alas, neither supports GROUP BY!

Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.

[reply]
[d/l]
[select]

Re^4: RFC: Array::GroupBy (DBD::Sponge)

by Porculus (Hermit) on Jan 09, 2009 at 23:45 UTC

Re: RFC: Array::GroupBy (Synopsis)
by lodin (Hermit) on Jan 06, 2009 at 10:58 UTC

It looks useful, but after glancing through your module I still don't know how to use it. I don't know how SQL GROUP BY works, and you say that's assumed, so I can't really complain. :-)

However, I think it's a good idea to have a full example in the synopsis. With full example I mean that it should compile and run, and some sensible output/results should also be included when applicable. Usually I get a good idea of what the code does and how I can use it by just looking at the input/transform/output triplet.

This is particularly important for this kind of module that transforms one data structure to another. You show the transform, but you don't show what goes in and what comes out. What is @AOH, and how does "key1" and "field1" relate to @AOH?

While the synopsis should just highlight what the module does, I often appreciate one or two slightly more elaborate (and possibly realistic) examples of the code at the end (under e.g. =head1 EXAMPLES).

Returning to including output in the synopsis; I often use __END__ to separate the code from the output (see examples at CPAN). I've written a test module for that which I've pondered for release. Basically, the synopsis becomes a part of the test suite. This makes sure that there are no typos in the synopsis and that it is up to date. If anyone shows any interest I might write up an RFC.

lodin

[reply]
[d/l]
[select]

Re^2: RFC: Array::GroupBy (Synopsis)

by kyle (Abbot) on Jan 06, 2009 at 16:30 UTC

I think an examples section is a good idea, but I wouldn't want a full working example in the synopsis. The data structures it's working on tend to be large on the page, and I want to keep the synopsis short. For example, here's the shortest of the functional test cases I have so far:

Read more... (1331 Bytes)

[reply]
[d/l]

Re: RFC: Array::GroupBy
by grinder (Bishop) on Jan 06, 2009 at 13:39 UTC

Interesting. What does using Moose buy you in this particular case? Or to put it another way, how much make-work code would you have to add if you didn't use it?

This is more than an idle question, because if I really needed your module for the very specific problem it solves, I think I'd be a bit upset if I discovered that it pulled in a comparatively huge OO framework. This is in no way meant as disrepect to Moose. It's more a question of ratios: the Moose tarball is 250Kb, not counting its twenty or so dependencies. Sledgehammer, meet egg.

Would it take twenty lines to implement this as a dependency-free module? A hundred lines? Would it be painful? I guess it comes down to whether one considers Moose as a reasonable building block for small, special-purpose modules.

• another intruder with the mooring in the heart of the Perl

[reply]

Re^2: RFC: Array::GroupBy

by jdporter (Paladin) on Jan 06, 2009 at 16:05 UTC

Sledgehammer, meet egg.

Ok.... But how many more such eggs will you need to see before you decide that this is one sledgehammer worth having in your toolbox?

[reply]

Re^3: RFC: Array::GroupBy

by grinder (Bishop) on Jan 06, 2009 at 16:40 UTC

Why, a dozen of course!

less flippant update: whether I think Moose is worthy of my toolbox (and I do) is beside the point. The question is whether you can yet decide that it is for everyone else.

• another intruder with the mooring in the heart of the Perl

[reply]

Re^2: RFC: Array::GroupBy (Moose)

by kyle (Abbot) on Jan 06, 2009 at 16:24 UTC

I'll start by saying that I expect Moose to be ubiquitous. If it's not now, it will be. Also, anyone installing something as obscure as Array::GroupBy is sure to have something as mainstream as Moose.

Anyway, what I got in this particular case was mostly Moose::Util::TypeConstraints. At object creation, and any time an attribute is set, the attributes are checked against the constraints I set. If I weren't using Moose, I'd be using Params::Validate. Aside from that, I get the usual OO framework stuff: a constructor and attribute accessors.

The make-work code I'm saved from writing is, again, mostly the input validation stuff. My version of Moose is a ways behind the current, so I wound up doing some of this myself in order to get the detail I wanted. I found it pretty tedious, and the results challenge comprehension. With a newer version of Moose, this was easier to do and easier to read. Without Moose (or some module), it could have been more work than writing the actual functionality of the module.

[reply]

Re^3: RFC: Array::GroupBy (Moose)

by grinder (Bishop) on Jan 06, 2009 at 16:59 UTC

You know, I'm happy to point Moose out to non-Perl hackers, it usually shuts them up about how much of a mess Perl's OO is. They tend to glance at an example and say "Oh this is nice! This actually makes sense". On the other hand "anyone installing something as obscure as Array::GroupBy is sure to have something as mainstream as Moose" sounds a lot like wishful thinking.

If you do release this to CPAN, I think Params::Validate as a prerequisite is a much more viable proposition. The fact that there's a very useful analog over in Moose land makes me think of someone saying "Hmm. I need to put my cup down somewhere. I know, I'll go out to my car and use the cup-holder."

• another intruder with the mooring in the heart of the Perl

[reply]

Re^4: RFC: Array::GroupBy (Moose)

by duckyd (Hermit) on Jan 07, 2009 at 00:49 UTC

Re^5: RFC: Array::GroupBy (Moose)

by kyle (Abbot) on Jan 07, 2009 at 02:20 UTC

Some notes below your chosen depth have not been shown here

Re^5: RFC: Array::GroupBy (Moose)

by Sartak (Hermit) on Jan 08, 2009 at 17:27 UTC

Re^3: RFC: Array::GroupBy (Moose)

by salva (Canon) on Jan 08, 2009 at 18:44 UTC

I'll start by saying that I expect Moose to be ubiquitous. If it's not now, it will be.

$ time perl -e 'package foo; use Moose'
real    0m1.052s
user    0m0.880s
sys    0m0.044s

$ time perl -e 'package foo'
real    0m0.008s
user    0m0.000s
sys    0m0.008s
[download]

BTW...

$ time perl -e 'package foo; use Params::Validate'
real    0m0.066s
user    0m0.028s
sys    0m0.008s
[download]

[reply]
[d/l]
[select]

Re^4: RFC: Array::GroupBy (Moose)

by stvn (Monsignor) on Jan 09, 2009 at 21:38 UTC

Re: RFC: Array::GroupBy
by Jenda (Abbot) on Jan 06, 2009 at 16:26 UTC

I do not see a reason to disallow empty group_by.

What about passing the whole records to the custom (code reference) agregators for some specially named group_how options? Eg.

 ...
 group_how => {
   '&foo' => sub {
    my ( $result, $rec) = @_;
    return $result + $rec->{a} * $rec->{b}
   },
 ...
[download]

Jenda
Support Denmark!
Defend the free world!

[reply]
[d/l]

Re: RFC: Array::GroupBy
by jdporter (Paladin) on Jan 07, 2009 at 16:14 UTC

Not to belittle your efforts in any way, but IMHO your solution is far too complicated and hard to comprehend, and I believe this is because it tries to do too much. IMHO, the grouping function should be separate from any processing/analysis of the groups. If you simply return the groups to the caller, she can do whatever processing on them in her own way.

Here is how I would probably approach the problem:

sub find_groups(&\@)
{
    my( $keymap, $rows_ar ) = @_;
    my %groups;
    for ( @$rows_ar )
    {
        push @{ $groups{ &$keymap } }, $_;
    }
    \%groups
}
[download]

count(*)

my $groups_hr = find_groups { $_->{'name'} } @rows;

for my $k ( sort keys %$groups_hr )
{
    print "$k\t" . @{$groups_hr->{$k}} . "\n";
}
[download]

find_groups { join $;, @{$_}{qw( name game score )} } @rows;
[download]

find_groups {
    $_->{'name'} eq 'foo' ? 1 :
    $_->{'game'} =~ /bar/ ? 2 :
    $_->{'score'} >= 500  ? 3 : 0
} @rows;
[download]

use List::Util qw( sum );

my $groups_hr = find_groups { $_->{'name'} } @rows;

for my $k ( sort keys %$groups_hr )
{
    my $avg = sum( map { $_->{'score'} } @{$groups_hr->{$k}} ) / @{$gr
+oups_hr->{$k}};
    print "$k\t$avg\n";
}
[download]

    map { $_->{'score'} } @rows
[download]

Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.

[reply]
[d/l]
[select]

Re^2: RFC: Array::GroupBy

by Porculus (Hermit) on Jan 09, 2009 at 23:56 UTC

Not being perlish is surely the whole point! This module seems to be aimed at programmers who are specifically looking for SQLish behaviour. Which is exactly what it provides.

Related to which, I'd say definitely keep the GroupBy name: it serves as a nice warning to SQL-haters, and a nice beacon to SQL-lovers, as to the type of solution this module offers. :)

[reply]


Perl Monk, Perl Meditation
	PerlMonks