Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

From test harness to CPAN

by dwm042 (Priest)
on Apr 16, 2011 at 17:44 UTC ( [id://899752] : perlquestion . print w/replies, xml ) Need Help??

dwm042 has asked for the wisdom of the Perl Monks concerning the following question:

I started blogging sports recently and while doing so I've been running into algorithms and procedures in the area of sports analytics (what used to be known as sabermetrics). Sports analytics received a huge shot in the arm with the publication of Michael Lewis's book "Moneyball", and these days, according to Mark Cuban, perhaps 2/3 of all US basketball teams have an analytics team. That said, some of us like to do this on an amateur basis. One important algorithm is Doug Drinen's simple rankings system. I've implemented this one 5 different ways now. A particularly clean version can be written in PDL.
sub srs_full_matrix { my $mov = shift; my $played = shift; my $opt = shift || (); my $epsilon = 0.001; my $maxiter = 1000000; my $debug = 0; $epsilon = $opt->{epsilon} if ( $opt->{epsilon} ); $maxiter = $opt->{maxiter} if ( $opt->{maxiter} ); $debug = $opt->{debug} if ( $opt->{debug} ); my $srs = $mov->copy(); my $oldsrs = $srs->copy(); my $delta = 10.0; my $iter = 0; while ( $delta > $epsilon and $iter < $maxiter ) { my $wt = 1.0 / sumover $played; my $prod = $srs x $played; my $sos = $wt * $prod; $srs = $mov + $sos; $delta = max abs ( $srs - $oldsrs ); $oldsrs = $srs->copy(); $iter++; } print "iter = $iter\n" if $debug; print "epsilon = $epsilon\n" if $debug; printf "delta = %7.4f\n", $delta if $debug; my $offset = sum $srs; $srs -= ( $offset / $mov->nelem ); return $srs->slice(":,(0)"); }
I have this function wrapped in Module::Starter with a working test harness now, and getting something like this into CPAN is the reason I just asked for a PAUSE account. Just, I don't see this code as CPAN ready.
1. There needs to be a non-pdl version.
2. The non-PDL version needs to accept very simple input and give a simple answer; input as an array of games, output as a hash of team named mapped to SRS values.
3. error checking on input.
4. helper functions? yes or no? Is something like this okay packaged with the PDL version?
sub load_srs_data { my $games = shift; my %team; for ( @$games ) { my ( $visitor, $visit_score, $home_team, $home_score ) = split + "\,", $_; my $diff = $home_score - $visit_score; $team{$visitor}{games_played}++; $team{$home_team}{games_played}++; $team{$visitor}{points} -= $diff; $team{$home_team}{points} += $diff; push @{$team{$visitor}{played}}, $home_team; push @{$team{$home_team}{played}}, $visitor; } my $total_team = scalar keys %team; my $played = zeroes $total_team, $total_team; my $mov = zeroes $total_team; my %team_map; my $ii = 0; for ( sort keys %team ) { my $team_diff = $team{$_}{points} / $team{$_}{games_played}; $team_map{$_} = $ii; $mov->set( $ii, $team_diff ); $ii++; } for ( keys %team ) { my $i = $team_map{$_}; for my $opp (@{$team{$_}{played}}) { my $j = $team_map{$opp}; my $a = $played->at ( $i, $j ); $a++; $played->set( $i, $j, $a ); } } return \%team, \%team_map, $mov, $played; }
What I'm trying to do is some "design on paper" before I upload..
and yes..
5. Name space. I was using Sports::Analytics::SRS but a simple ranking function isn't the "System". SimpleRanking spelled out with Analytics or SportsAnalytics as a predecessor seems more appropriate.
Sports isn't an initial name anywhere in CPAN that I can see.

Algorithm::SportsAnalytics::SimpleRanking, perhaps, with Algorithm::SportsAnalytics::SimpleRanking::PDL for a PDL implementation?

Replies are listed 'Best First'.
Re: From test harness to CPAN
by GrandFather (Saint) on Apr 16, 2011 at 21:10 UTC

    Taking things in backwards order:

    • 5/ What did you search for when you checked CPAN for prior art?

      Ask yourself what someone interested in this stuff is most likely to search for and chose a name space and a name that is a good fit for that search. Also make sure that the key words appear in the POD.

      So far as I can find, at present Sport doesn't feature on CPAN (seems a large omission), nor does any sport related analytics.

    • 4/ Absolutely use helper functions, but see comments later.
    • 3/ Absolutely use error checking on input, and anywhere else that will catch problems early rather than late. See comments below.
    • 2/ See comments below.
    • 1/ See comments below.

    After a quick glance at your code there are a few things I'd change. First off, make it OO. That way you don't need to pass back a mess of data that your client is just going to have to pass back in elsewhere. It also makes dealing with PDL/non-PDL handling tidier.

    Handle errors using exceptions (die and eval). That makes dealing with errors in an appropriate place much easier for your code and for the caller's code.

    With OO code helper functions are just something provided by the object.

    A partial implementation of your code as OO might look like:

    use strict; use warnings; package Sport::Analytics::SRS; return 1; # madule returns true for successfull load sub new { my ($class, %params) = @_; # parameter validation here. die on failure. # Test for PDL $params{havePDL} = eval { require 'PDL'; PDL->import (); 1; }; return bless \%params, $class; } sub srs_full_matrix { my ($self, %options) = @_; die "The module PDL must be available for srs_full_matrix use." if !$self->{havePDL}; $options{epsilon} ||= 0.001; $options{maxiter} ||= 1000000; $self->{srs} = $self->{mov}->copy (); my $oldsrs = $self->{srs}->copy (); my $delta = 10.0; my $iter = 0; while ($delta > $options{epsilon} and $iter < $options{maxiter}) { my $wt = 1.0 / sumover ($self->{played}); my $prod = $self->{srs} x $self->{played}; my $sos = $wt * $prod; $self->{srs} = $self->{mov} + $sos; $delta = max (abs ($self->{srs} - $oldsrs)); $oldsrs = $self->{srs}->copy (); $iter++; } print "iter = $iter\n" if $options{debug}; print "epsilon = $options{epsilon}\n" if $options{debug}; printf "delta = %7.4f\n", $delta if $options{debug}; my $offset = sum ($self->{srs}); $self->{srs} -= ($offset / $self->{mov}->nelem); return $self->{srs}->slice (":,(0)"); } sub loadData { my ($self, $games) = @_; for (@$games) { my ($visitor, $visit_score, $home_team, $home_score) = split " +\,", $_; my $diff = $home_score - $visit_score; $self->{team}{$visitor}{games_played}++; $self->{team}{$home_team}{games_played}++; $self->{team}{$visitor}{points} -= $diff; $self->{team}{$home_team}{points} += $diff; push @{$self->{team}{$visitor}{played}}, $home_team; push @{$self->{team}{$home_team}{played}}, $visitor; } my $total_team = scalar keys %{$self->{team}}; $self->{played} = zeroes ($total_team, $total_team); $self->{mov} = zeroes ($total_team); my $ii = 0; for (sort keys %{$self->{team}}) { my $team_diff = $self->{team}{$_}{points} / $self->{team}{$_}{games_played +}; $self->{team_map}{$_} = $ii; $self->{mov}->set ($ii, $team_diff); $ii++; } for (keys %{$self->{team}}) { my $i = $self->{team_map}{$_}; for my $opp (@{$self->{team}{$_}{played}}) { my $j = $self->{team_map}{$opp}; my $a = $self->{played}->at ($i, $j); $a++; $self->{played}->set ($i, $j, $a); } } }

    so the module might be used like:

    use Sport::Analytics::SRS; eval { my $analysis = Sport::Analytics::SRS->new (); $analysis->loadData (); $analysis->fullMatrix (); ... } or do { die "Analysis failed: $@\n"; };
    True laziness is hard work

      There are some clever ideas in your post and I buy the idea that an object would make things easier. In a PDL implementation there will be a temptation to add features, because once you do the work of dataLoad, you'll want to have access to "more stuff". That's item one.

      Item two is I'm testing your idea and the moment I start using the "pdl" command, the Perl interpreter doesn't appear to take well to the 'load it if we need it' idea.

      I'm thinking two objects, a simpler one for non-PDL use and a more complicated one (that could do more kinds of rankings) for PDL use. But those are my current thoughts.

      Using Sport as a base (the singular word as opposed to plural) is interesting as then people could write analysis objects for a particular sport.. Sport::Cricket or Sport::Baseball, etc.


        Most likely the 'load it if we need it' failed for a second object because PDL was already loaded. I was thinking more in terms of "use it if it's available" in any case. The following code implements that:

        use strict; use warnings; package TestPDL; my $PDLLoaded = eval { require ''; PDL->import (); 1; }; sub new { my ($class, %params) = @_; $params{havePDL} = $PDLLoaded; return bless \%params, $class; } package main; my $obj = TestPDL->new (); my $obj2 = TestPDL->new (); print $obj2->{havePDL} ? "Have PDL available" : "PDL not present or object create failed";

        I'd write one basic object that does common stuff then derive from that to specialise for additional capabilities. Following that thought, I'd arrange the name space as Sport::Analytics::SRS for the basic module then Sport::Analytics::SRS::Cricket for the Cricket specialisation.

        This technique allows a PDL only implementation for tricky or computationally expensive stuff with a pure Perl implementation provided either in the module or in a derived class at some point in the future if there is a need for it.

        True laziness is hard work
Re: From test harness to CPAN
by anonymized user 468275 (Curate) on Apr 18, 2011 at 14:16 UTC
    I think it would be more useful to raise the level of abstraction of the data beyond such a fixed usage. For example, the module could have a default config file with the possibility to specify an override file in the new method. The config file would specify all possible column ids each having characterisation columns such as action-to-be-performed, whether derived, whether input and whether included in output. Furthermore, actions for a column could be defined or overridden by including a sub-hash of code references in the instance variable that can also be modified via the "new" method.

    One world, one people

Re: From test harness to CPAN
by jgamble (Pilgrim) on Apr 19, 2011 at 22:13 UTC

    I'm not convinced that sabermetrics is a supplanted terminology -- the links that you provided actually make use of the term, although since its origins are in baseball that may explain why other sports want a different terminology. And, of course, there already exists Baseball-Sabermetrics.

    Having essentially two top-level names (Algorithm::SportsAnalytics::TheModuleName) strikes me as wasteful in space. I'd go to the module-authors list and suggest a single top name like SportsAnalytics and see how it flies.

    If SportsAnalytics doesn't work, then maybe an offshoot of Statistics (e.g., Statistics::Baseball::Analytics)?

      Presumably some of these algorithms would be useful for things that aren't sports - competitive pigeon breeding, for example, or board games.
      Just as an FYI, I'm asking for Sport as a top name. The Namespace request form doesn't give me great hope that I'll be successful. Getting from my code above, parsed through Grandfather's suggestions, to something I could post and think was reasonable was an interesting journey.