Re^3: Module for intelligently analyzing and merging spreadsheet data

That's a good idea kschwab. I can explain some things but I have no experience with either module.

In the train() mode, you need to pass on a lot of data cases, each in the form of an array of hashrefs, each as the one you already have in your post:

    {
        attributes => {
            phone => 1, 'last name' => 1,  'fname' => 1, mobile => 1
        },
        labels => ['has header']
    },
[download]

which means that, in this case predictor "phone" has a weight of 1, "last name" the same, etc. And you, the human, classified this case as "has header".

What does a weight mean? Let's say here in your case it is the number of times it occured in your single data case. Each data case will have its own weights for each predictor. Weight can be other things or a combination, for example: number of times it occurs, whether it is capitalised, whether it is at the beginning of a sentence etc.etc.

And on you continue with your next data case. etc. Ideally you should represent all labels, "has header" and I guess, "has no header". All these in a single ~~hash~~ array (of the hashrefs mentioned above) to be given as parameter to train()

Then it's time to classify some unknown cases. Using the couplet:

my $result = $classifier->classify({phone => 3, fname => 0, ...});
my $best_category = $result->best_category;
[download]

$best_category will be one of "has header", "has no header" for that particular data case you classify(). The classifier $result can tell you also what influence each field/predictor has using my $predictors = $result->find_predictors; (see AI::NaiveBayes::Classification)

The trick is to find some predictors that you think differentiate the two labels. For example one has far fewer "phone" and the other has a lot. Then a weight for each of the predictors has to be calculated by you, or naively put the number of occurences in each data case you have. Just to start. I am not sure of predictors with zero weight for that particular data case have to be mentioned in train() or will be inferred and set to zero if at least one data case mentions them and others do not. I think they will be inferred if absent from particular data case but present in at least one other data case.

Forgot to mention that a data case can belong to many labels! That's why you have that arrayref in labels => [...] (note: data case = data row = a single observation)

Code taken from AI::NaiveBayes

bw, bliako

Comment on Re^3: Module for intelligently analyzing and merging spreadsheet data Select or Download Code


There's more than one way to do things
	PerlMonks