RFC: Presentation on Machine Learning with Perl

Greetings Fellow Monks,

Later this month, I will be giving a talk titled: Machine Learning Made Easy with Perl. A preliminary outline is:

Data gathering with Finance::YahooQuote
Data munging with Perl
Data visualization with PGPLOT and TriD
Data clustering with FCM implemented using PDL
Results visualization with PGPLOT

Data classification with SVM using a LIBSVM binary called with IPC::Open3

Data classification with Radial Basis Function Networks implemented using PDL

In each part, I plan to discuss the problem, the strategy to solve it, the choice of machine learning technique and the main configuration issues the participants need to understand to successfully deploy machine learning applications. I will also show snippets of the code used. For example:

For data gathering using Finance::YahooQuote:

#!/usr/bin/perl
use strict;
use warnings;

use Finance::YahooQuote;

my @symbols = ("IBM","DELL","GOOG","YHOO","MSFT","ORCL","SAP","COGN", 
+"BOBJ");
my @columns = ("Last Trade (Price Only)","Last Trade Date","Last Trade
+ Time","Day's Range","52-week Range","EPS Est. Next Year","P/E Ratio"
+,"PEG Ratio","Dividend Yield");

my $arrptr = getcustomquote(\@symbols, \@columns);

my $i = 0;
foreach my $symbol (@symbols){
    my @quotes = @{$arrptr->[$i++]};
    print "$symbol\t@quotes\n";
}
[download]

For the FCM:

use strict;
use warnings;

use PDL;
use PDL::NiceSlice;
# ================================
# fcm
# ( $performance_index, $prototypes, $current_partition_matrix) = 
#   fcm( $patterns, $partition_matrix, $fuzzification_factor,
#        $tolerance, $max_iter )
# ================================
sub fcm {
#
# fuzzy c means implementation
#
    my ( $patterns, $current_partition_matrix, $fuzzification_factor, 
+$tolerance, $max_iter ) = @_;
    my ( $number_of_patterns, $number_of_clusters ) = $current_partiti
+on_matrix->dims();
    my ( $prototypes, $performance_index );
    my $iter = 0;
    while (1) {
        # computing each prototype
        my $temporal_partition_matrix = $current_partition_matrix ** $
+fuzzification_factor;
        my $temp_prototypes = ($temporal_partition_matrix  x $patterns
+)->xchg(1,0) / sumover($temporal_partition_matrix);
        $prototypes = $temp_prototypes->xchg(1,0);

        # copying partition matrix
        my $previous_partition_matrix = $current_partition_matrix->cop
+y;

        # updating the partition matrix
        my $dist = zeroes($number_of_patterns, $number_of_clusters);
        for my $j (0..$number_of_clusters - 1){
            my $diff = $patterns - $prototypes(:,$j)->dummy(1, $number
+_of_patterns);
            $dist(:,$j) .= (sumover( $diff ** 2 )) ** 0.5;
        }

        my $temp_variable = $dist ** (-2/($fuzzification_factor - 1));
        $current_partition_matrix = $temp_variable / sumover($temp_var
+iable->xchg(1,0));

        #
        # Performance Index calculation
        #
        $temporal_partition_matrix = $current_partition_matrix ** $fuz
+zification_factor;
        $performance_index = sum($temporal_partition_matrix * ( $dist 
+** 2 ));

        # checking stop conditions
        my $diff_partition_matrix = $current_partition_matrix - $previ
+ous_partition_matrix;
        $iter++;
        if ( ($diff_partition_matrix->max < $tolerance) || ($iter > $m
+ax_iter) ) {
        last;
        }
        print "iter = $iter\n";
    }
    return ( $performance_index, $prototypes, $current_partition_matri
+x );
}
[download]

I expect the audience to be mainly Perl savvy people. However, the talk is open to all the people attending the conference. Therefore, some people in the audience might not be familiar with Perl.

The talk is scheduled to last 45 minutes. I plan to cover each part in about 10 minutes to leave between 5 and 10 minutes for questions and answers. I do not plan to explain the snippets in detail because I do not have enough time. However, I will make the code available for all those interested. My questions for you Fellow Monks are:

If you were attending this session, would you expect me to describe the code in detail?
Do you think it is a good strategy to concentrate on the machine learning part rather than on the Perl part?
What suggestion do you have in terms of points that I should (should not) cover?
Any other suggestions? thoughts?

Thank you,

lin0

Update: Fixed typo in header of FCM sub

Comment on RFC: Presentation on Machine Learning with Perl Select or Download Code

Replies are listed 'Best First'.
Re: RFC: Presentation on Machine Learning with Perl by Trizor (Pilgrim) on Jul 05, 2007 at 09:08 UTC
Provide a reason for using Perl versus something else, and the modules you chose (I know several don't have alternatives). Also make sure that the FCM algorithm gets accross despite any possible language barriers that may exist in your audience. I suggest showing a flowchart of the algorithm before the Perl implementation and then highlighting some of the stages within the Perl. Also check your function header for the fcm function, I don't think it is accurate. Regarding the SVM part, try to explain SVM better than the wikipedia article. I just couldn't grok it so I don't have much else to say. Perhaps explain why you're using IPC::Open3 to talk to a library and not XS or Inline? The third part seems rather easy to understand if one has a basic knowledge of ANNs and how they're represented mathematically, the one major inconsistency I find is you talk a lot about doing things with Data, but what data will you use? Will it be the stock market data mined in the beginning of PartI for consistency or will you use simpler data later on to allow the points to shine through?	[reply]
Re^2: RFC: Presentation on Machine Learning with Perl by lin0 (Curate) on Jul 05, 2007 at 20:55 UTC
Hi Trizor, Thank you very much for your feedback. I really appreciate it! I will address your comments one by one. Please, let me know if I miss something ;-) Provide a reason for using Perl versus something else, and the modules you chose (I know several don't have alternatives). About Perl, I want to show that Perl is a valid alternative for machine learning. I do not claim that Perl is the best option for every single application in which you might want to use machine learning. However, I claim that Perl can shine in different aspects, which is related to your second comment. The modules were selected to show different ways in which you can use Perl for machine learning (they represent only one way of the many ways to do things using Perl): For data gathering, visualization, and analysis (Part I). It really is easy to mine the web for data using Perl. Once you have the data, you can easily transform them to have a format that would facilitate further analysis. Perl also allows you to quickly plot the data to facilitate collaboration with the problem domain expert. The choice of Fuzzy C-Means (FCM) for data analysis has to do with my expertise in using it to make sense of data ;-) Writing a FCM implementation in Perl was one of the first things I did when learning Perl. So I am really proud of it :-) For Decision Support Systems (Part II). Here, instead of using one of the CPAN modules for Support Vector Machines (SVM), I decided to call a SVM binary using IO::Open3. The main reason for doing so, is that I want to show that you can easily call applications written in other languages using Perl. This is just other way of using Perl for machine learning: you do the data gathering and preparation using Perl and then you call an application written in another language. The data for this part consists of image data and clinical records of patients with Scoliosis that participated in one study we did at my University. Note: the data is not publicly available because we do not have ethics approval to do so. Our ethics approval is only for data analysis in our lab. For Pattern Recognition (Part III). The choice of writing my own radial basis function neural network code has to do with the fact that I like to learn by doing. Again, I translated some old code of mine to Perl. The data for this part comes from Environment Canada. The problem we wanted to solve was to classify storm cells in one of four possible classes: Hail, Rain, Tornado, Win. Note: this data is not publicly available. It belongs to Environment Canada. Also make sure that the FCM algorithm gets accross despite any possible language barriers that may exist in your audience. I suggest showing a flowchart of the algorithm before the Perl implementation and then highlighting some of the stages within the Perl. Also check your function header for the fcm function, I don't think it is accurate. Explaining the FCM should not be that hard considering that I have several years of experience presenting my research with it to general and scientific audiences. Regarding the function header, you are right, I will fix it as soon as I can. Regarding the SVM part, try to explain SVM better than the wikipedia article. I just couldn't grok it so I don't have much else to say. Perhaps explain why you're using IPC::Open3 to talk to a library and not XS or Inline? I will do my best! I like to explain the SVM comparing it with a neural networks classifier in solving a two-class classification problem. In particular, I like to stress that while the outputs of the neural network classifier are obtained using any plane that would separate the two classes, the outputs of the SVM are obtained using the plain that maximizes the separation between classes. Regarding the use of IPC::Open3, I already explained that when answering your first set of comments. The third part seems rather easy to understand if one has a basic knowledge of ANNs and how they're represented mathematically, the one major inconsistency I find is you talk a lot about doing things with Data, but what data will you use? Will it be the stock market data mined in the beginning of PartI for consistency or will you use simpler data later on to allow the points to shine through? As I mentioned above, the data for Parts II and III are different from that in Part I. For Part II, I will use clinical data. For Part III, I will use weather data. In my experience, the data in Part II is the most complex one, then the one in Part III. The data in Part I is the simplest of the three. Again, Trizor thank you for your comments. Cheers, lin0	[reply]
Re: RFC: Presentation on Machine Learning with Perl by bibliophile (Prior) on Jul 05, 2007 at 14:17 UTC
Ok... this isn't directly relevent to your presentation, but it did twig a thought... I read a lot of online newspapers, subscribe to a lot (too many!) RSS feeds, and have a huge list of sites I try to keep up with. In my perfect world, I'd have a system that could do a content / context scan of all this raw data, and present me with just the stuff I'm particularly interested in. I'd write the Parse::MeaningFromText and Mind::Read::MyInterests, but (what with all the reading I'm doing) I just don't have the time.... :-)	[reply]
Re^2: RFC: Presentation on Machine Learning with Perl by lin0 (Curate) on Jul 05, 2007 at 21:06 UTC
Hi bibliophile, It is a very good thought, indeed. However, you would need to think carefully and extensively on what kind of features the articles you are interested in have in common. You could use some sort of data clustering (FCM, maybe?) to help you with this task. You would then need to find a way to extract those features consistently. Finally, you could use a classifier to filter the raw data and present you only with the stuff you are interested in. When you design the classifier, try to incorporate a confidence index that tells you how reliable the results are. In this way, you could play with the outputs until you are happy with the results. Does it make sense? Cheers, lin0	[reply]
Re^3: RFC: Presentation on Machine Learning with Perl by bibliophile (Prior) on Jul 06, 2007 at 15:10 UTC
It does make sense... at least as far as my (quite limited) knowledge of ML goes :-) One of my always-backburnered thoughts was to build a neural-net-backed "observer" that would watch my browsing habits for a few months, noting things like how long I spend on a particular page, whether I follow links from it, etc., and from that be able to make predictions on stuff I might be interested in. One of these days^H^H^H^Hyears....	[reply]
Re: RFC: Presentation on Machine Learning with Perl by mattr (Curate) on Jul 06, 2007 at 11:16 UTC
Hi there, Sounds quite interesting. Any chance of videotaping it? You could put it on your site, YouTube (though low res), or Zudeo, or democracy player, etc. I enjoyed watching video on my iPod of sessions I missed at YAPC::Asia, although the code projected on the screen was too small to see. Certainly a bunch of videos on interesting subjects in Perl could be a great way to introduce people to it. By the way I did a survey of natural language parsing programs in Perl a while back just as an initial dip into it, but never actually had an opportunity to use those tools. I don't remember if it is Perl (a lot are Java but I think this is not) but have you used the uk program GATE and are you going to talk about that sort of thing (head parsing, automatic categorization/chunking, extraction of key noun phrases, etc.)? I am not familiar with the apps you are talking about but am quite interested in how to easily incorporate machine learning into my Perl systems. Matt	[reply]
Re^2: RFC: Presentation on Machine Learning with Perl by lin0 (Curate) on Jul 06, 2007 at 13:32 UTC
Hi there, Hi mattr, Sounds quite interesting. Any chance of videotaping it? You could put it on your site, YouTube (though low res), or Zudeo, or democracy player, etc. I enjoyed watching video on my iPod of sessions I missed at YAPC::Asia, although the code projected on the screen was too small to see. Certainly a bunch of videos on interesting subjects in Perl could be a great way to introduce people to it. I would have to ask the organizers. Thanks for bringing that up to my attention By the way I did a survey of natural language parsing programs in Perl a while back just as an initial dip into it, but never actually had an opportunity to use those tools. I don't remember if it is Perl (a lot are Java but I think this is not) but have you used the uk program GATE and are you going to talk about that sort of thing (head parsing, automatic categorization/chunking, extraction of key noun phrases, etc.)? I am not familiar with the apps you are talking about but am quite interested in how to easily incorporate machine learning into my Perl systems. I am not familiar with GATE. However, it looks to me that it is written in Java, at least that is what is said in their SourceForge page ;-) About natural language processing, I am not going to talk about it in this presentation. I decided to focus only on problems I have already worked on. In any case, natural language processing is certainly something I am interested in. So, I will be more than happy to read about any lead you have in that area. Thank you lin0	[reply]
Re^3: RFC: Presentation on Machine Learning with Perl by mattr (Curate) on Jul 07, 2007 at 13:13 UTC
Hi there, I have not got a lot of experience with these but will provide you a few links here. Some (Lingua:: modules and Wordnet modules at least) are CPAN modules. Article on Lingua::LinkParser. List of various Perl modules for parsing including Sensenet, Wordnet, Duluth. Stanford NLP resource list Hope these are a good starting point for you. A lot of NLP is in Java but there are also Perl resources too. Matt	[reply]
Re: RFC: Presentation on Machine Learning with Perl by toma (Vicar) on Jul 10, 2007 at 08:07 UTC
This sounds like an interesting talk and I hope to see it. It will be a great talk if I leave excited about what I can do with the tools, how I can get started, how to scale up from my starting point, and what to do if I get stuck. Thanks for this opportunity to make requests about your talk. Here are some ideas, both general and specific. Please feel free to use or ignore any of them. When I see a talk like this, I want to learn things that will help me get started more quickly if I decide to do something similar. I like to learn the approach for a minimal working example (like the synopsis). Then I am ready to hear a story about what you did to scale it up to a solve a real-world problem. Typically I wouldn't really care about your exact example problem, I just want to explore the boundaries. How well does it scale? I want inside information that isn't in the documentation. For example, when you have a problem, how do you get good help? Is the code problematic on some platforms? What tools, libraries and skills are needed to make the system work? Do APIs break at each release, or are they stable? For example I think that PGPlot is great and I use it, but it can be hard to build and get working. If you really need the features, use it, but if you don't, there are much easier alternatives. It has the classic build problem of a large number of options, many dependencies, and I haven't figured out how it is simple to use on simple problems. It would be of great value to me if you could explain an easy way to build and use PGPlot on Windows, Linux and OSX ;-). This is the kind of inside information that makes attending the conference a good investment. Perl is great for gathering huge amounts of data. The challenge quickly becomes solving problems with the dataset, for example those caused by network and server outages or other annoyances. It would be good to hear about SVM and what sort of problems it is good for. For example, how much data do you need? How much more data is needed for each new feature? What if your data isn't perfectly clean? What types of data are usually used? When would I want to use Perl with LIBSVM instead of R or Matlab? Thanks - I hope to attend your talk. It should work perfectly the first time! - toma	[reply]
Re: RFC: Presentation on Machine Learning with Perl by Tabari (Monk) on Jul 09, 2007 at 09:53 UTC
Wish I could attend. Since you want to cover a lot of ground, I would stress the fact that the entire project is doable in perl, i.e. that the different kind of libraries you need, are all available and easy to use/install. So I would go for the combination of architcture and HL solution in perl. If the people are perl savvy enough, they should be able to read the code details, otherwise, they won't follow them anyhow. Tabari	[reply]

Back to Meditations