comment on

Many machine learning and data analysis tasks involve calculating distances between items. The Mahalanobis distance is a very popular distance because it is scale invariant.

In this snippet, I present how to compute the Mahalanobis distance using the Perl Data Language. The inputs are two or three piddles (see comment below for a definition). The first piddle is a p-dimensional vector. The second piddle could be either a p-dimensional vector (when a third input is provided) or a matrix with N rows of p-dimensional vectors. If the second piddle is a matrix, the distance is computed between the center of the second piddle and the first piddle (if only two inputs are provided, the second piddle is used to compute the covariance needed to determine the Mahalanobis distance). The third piddle, which is optional, represents the covariance matrix of the distribution from which the two other piddles were drawn. Note: to compute the covariance matrix, I use the snippet presented in Computing Covariance Matrices with PDL

What are Piddles?

They are a new data structure defined in the Perl Data Language. As indicated in RFC: Getting Started with PDL (the Perl Data Language):

Piddles are numerical arrays stored in column major order (meaning that the fastest varying dimension represent the columns following computational convention rather than the rows as mathematicians prefer). Even though, piddles look like Perl arrays, they are not. Unlike Perl arrays, piddles are stored in consecutive memory locations facilitating the passing of piddles to the C and FORTRAN code that handles the element by element arithmetic. One more thing to note about piddles is that they are referenced with a leading $

Cheers,

lin0

#!/usr/bin/perl
use warnings;
use strict;
use PDL;

# ================================
# mahalanobis: 
#
#   $distance = mahalanobis( $x, $y, $cov )
#
#   computes the mahalanobis distance from a point
#   $x to another point $y (from the same 
#   distribution) or from a point $x to
#   the centre of a group of values $y
#
# ================================
sub mahalanobis {
    my ( $x, $y, $cov, $diff, $dist );
    if ( @_ < 3 ) {
        ( $x, $y ) = @_;
        $cov = covariance( $y );
    } else {
        ( $x, $y, $cov ) = @_;
    }
    
    if ( $y->getdim(1) > 1 ) {
        $diff = $x - average( $y->xchg(0,1) );
    } else {
        $diff = $x - $y;
    }
    
    my @dist = list( $diff x inv( $cov ) x transpose( $diff ) );
    
    return $dist[0];
}
[download]

In reply to Computing the Mahalanobis distance with the Perl Data Language by lin0

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


go ahead... be a heretic
	PerlMonks