Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Data Mining with Perl

by atlantageek (Monk)
on Nov 23, 2001 at 05:16 UTC ( #127042=perlmeditation: print w/replies, xml ) Need Help??

I have long had an interest in Data Mining and though I have never worked in the field I do read the occasional book on the topic. One thing that seems kind of weird to me though is the fact that Data Mining is done mostly with specialized software. I understand this to a certain extent however the whole point of Data mining is to look at data in different ways. This would suggest that when starting a Data Mining tool you should have a very flexible (not specialized) tool to do the initial research. A solid Spreadsheet package seems to be a good starting point for any basic research but it seems that perl would be the next most useful tool. This should be even more true with the availability of unstructured data over the internet.

So my question is has anyone used Perl heavily in Data Mining. I am interested in personal as well as corporate level use.

If you don't think that Perl is a good fit please explain why?
I always wanted to be somebody... I guess I should have been more specific.

Replies are listed 'Best First'.
Re: Data Mining with Perl :: Use the right tool for the job!
by jeroenes (Priest) on Nov 23, 2001 at 13:29 UTC
    Be warned. As a neuroscientist, I'm in the data analysis business, more than the mining variant. The enormous data flow and the nature of my experiments generate however cause my analysis to mimick the mining a bit IMHO. With this disclaimer in mind:

    Perl indeed is not an analysis tool per se. It is however undismissible in its ability to handle varies formats but also in the development time of your scripts. You will end up using different tools right through each other:

    Use the right tool for the job!

    This is essential. Always decently think it through before you do something with a certain tool. Can this tool do the job? How much time will I have to spent learning the tool? How much time will I spend coding? How much time will I spend chrunching numbers (or swapping memory space ;-)? Of course don't spend more than appropiate time figuring this out.

    It really depends on what you're going to do which tools you want to use. For web grabbing, text manupulations, file manipulations and reporting perl is the tool you need. If you really have to work with matrici of data (so more variables per item or more items per variable than you can handle easily) I seriously would stay away from spreadsheets. They are pretty inflexible when it comes to restating your computations or recalculate your reports/graphs. Believe me, I have started that way. I didn't know how fast I had to turn excel down in favour of turbo pascal. Which is a pale toolkit compared to perl.

    Perl has PDL for basic matrix manipulation. If you want to go further, you will either end up with Matlab ( or S-plus. Both have a very nice computation language, with extensive statistical tools. Moreover, it's really easy to write your own statistics and to plot the results. I'm a matlab user myself, but s-plus is equally fit as far as I have heard.

    They both have opensource equivalents, octave and 'R'. I don't know for R, but octave is a decent clone when it comes to basic matlab stuff, but for many toolboxes and for nice graphs you'll have to stick with matlab. On the other hand, someone was posting an Inline::Octave proposal on the inline mailing list. This could be very interesting. When I start 'R' I've got myself a nice window, but I can't tell you anything about its functionality.

    While I was writing the 2nd alinea, I got an idea, ran it through a perl/matlab/origin cycle and was pretty excited with the results. (Origin is my favourite graphing program). You see, I use quite some tools in parallel myself.

    Feel free to /msg me if you want to know some more details.


    "We are not alone"(FZ)

      I gotta throw a mention to Scilab in at this point. Another free clone of matlab, to the point where the code you write is almost completely compatible (a quick dose of perl makes it completely compatible). Debian compatible license, and a built in tutorial. Lovely. Runs under X11, etc.

      I didn't believe in evil until I dated it.

Re: Data Mining with Perl
by jepri (Parson) on Nov 23, 2001 at 06:37 UTC
    Perl lacks the heavy duty statistical analysis tools that are needed to do really good data mining.

    There is no good reason why they couldn't be written, apart from the obvious "I don't have the time/need". At the moment perl may be great at transforming data, but it ain't very hot at analysisng it.

    Another reason may be that it the sort of calculations needed for analysing stats are often heavy-duty floating point and other number ops, which perl is only ok at (rather than being great at).

    I didn't believe in evil until I dated it.

      Perl lacks the heavy duty statistical analysis tools that are needed to do really good data mining.

      I knew of one commercial data mining system that was written in Smalltalk, which also lacks heavy duty statistical analysis tools. That didn't stop people from building them.

      P.S. I'm agreeing with jepri here, in case that's unclear.

        Some Friday afternoon doggerel, by way of apology to dws for completely and utterly misinterpreting what he meant:

        I am the very model of a modern perlmonks posting troll,
        with arguements both scathing and totally irrational,
        I never stop to think the true intent of things I hear,
        All this cyber culture is just way to much for me to bear.

        Except that doesn't scan. Bugger.

        I didn't believe in evil until I dated it.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://127042]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2020-12-05 10:52 GMT
Find Nodes?
    Voting Booth?
    How often do you use taint mode?

    Results (63 votes). Check out past polls.