Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^2: Module for intelligently analyzing and merging spreadsheet data

by nysus (Parson)
on Feb 10, 2019 at 23:45 UTC ( [id://1229751]=note: print w/replies, xml ) Need Help??


in reply to Re: Module for intelligently analyzing and merging spreadsheet data
in thread Module for intelligently analyzing and merging spreadsheet data

Thanks, yeah. I was actually looking at some of the Bayesian modules to see if they could help me determine if the spreadsheet had a header row. I couldn't quite figure out how to do it, though. And there seems to be two different Bayesian modules,the one you mentioned and this one: AI::NaiveBayes

I'm not sure which to use or if they do anything different.

Here is the code I was experimenting with:

my $classifier = AI::NaiveBayes->train( { attributes => { phone => 1, 'last name' => 1, 'fname' => 1, mobile => 1 }, labels => ['has header'] }, ); # Classify a feature vector my $result = $classifier->classify({fname => 5}); # $result is now a AI::NaiveBayes::Classification object my @predictors = $result->find_predictors;

I wasn't sure where to go from there or how to better train it.

$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
$nysus = $PM . ' ' . $MCF;
Click here if you love Perl Monks

Replies are listed 'Best First'.
Re^3: Module for intelligently analyzing and merging spreadsheet data
by bliako (Monsignor) on Feb 11, 2019 at 10:37 UTC

    That's a good idea kschwab. I can explain some things but I have no experience with either module.

    In the train() mode, you need to pass on a lot of data cases, each in the form of an array of hashrefs, each as the one you already have in your post:

    { attributes => { phone => 1, 'last name' => 1, 'fname' => 1, mobile => 1 }, labels => ['has header'] },

    which means that, in this case predictor "phone" has a weight of 1, "last name" the same, etc. And you, the human, classified this case as "has header".

    What does a weight mean? Let's say here in your case it is the number of times it occured in your single data case. Each data case will have its own weights for each predictor. Weight can be other things or a combination, for example: number of times it occurs, whether it is capitalised, whether it is at the beginning of a sentence etc.etc.

    And on you continue with your next data case. etc. Ideally you should represent all labels, "has header" and I guess, "has no header". All these in a single hash array (of the hashrefs mentioned above) to be given as parameter to train()

    Then it's time to classify some unknown cases. Using the couplet:

    my $result = $classifier->classify({phone => 3, fname => 0, ...}); my $best_category = $result->best_category;

    $best_category will be one of "has header", "has no header" for that particular data case you classify(). The classifier $result can tell you also what influence each field/predictor has using my $predictors = $result->find_predictors; (see AI::NaiveBayes::Classification)

    The trick is to find some predictors that you think differentiate the two labels. For example one has far fewer "phone" and the other has a lot. Then a weight for each of the predictors has to be calculated by you, or naively put the number of occurences in each data case you have. Just to start. I am not sure of predictors with zero weight for that particular data case have to be mentioned in train() or will be inferred and set to zero if at least one data case mentions them and others do not. I think they will be inferred if absent from particular data case but present in at least one other data case.

    Forgot to mention that a data case can belong to many labels! That's why you have that arrayref in labels => [...] (note: data case = data row = a single observation)

    Code taken from AI::NaiveBayes

    bw, bliako

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1229751]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2024-04-24 01:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found