Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

help with AI::Categorizer

by downer (Monk)
on Nov 09, 2007 at 16:48 UTC ( [id://649937]=perlquestion: print w/replies, xml ) Need Help??

downer has asked for the wisdom of the Perl Monks concerning the following question:

AI::Categorizer has the potential to be a very nice tool for my research, the ability to provide a perl interface allowing easy integration with my other scripts and a classifier is enticing. However, the documentation is very poor, I know I could probably stumble around for a while and figure it out. but this seems like a waste of time. I am asking if any monks out there have used this module before, and if so, could you provide an example of working code: the specifying of initial documents to classify, the training process, and potential classification of new documents. thanks!

Replies are listed 'Best First'.
Re: help with AI::Categorizer
by planetscape (Chancellor) on Nov 09, 2007 at 22:21 UTC

    Never used the module myself, sorry. But I did what I'd do in your place: I typed "AI::Categorizer" into Google's CodeSearch, and browsed the results. SyndromeSurveillance.pm looked interesting.

    HTH,

    planetscape
Re: help with AI::Categorizer
by jdporter (Paladin) on Nov 10, 2007 at 03:34 UTC

    I know about AI::Categorizer from discussions on the perl.ai mailing list. So I would do a Google Groups search: perl.ai AI::Categorizer. If that doesn't help much, you could try actually signing up on that mailing list and asking there. Ken Williams may still be listing on that frequency. Heck - you could even try contacting Ken directly.

    A word spoken in Mind will reach its own level, in the objective world, by its own weight
Re: help with AI::Categorizer
by randyk (Parson) on Nov 10, 2007 at 21:15 UTC
    Here's an example which uses the CPAN subject categories for the training set, and then classifies modules according to which category they probably best fit into:
    use strict; use warnings; require AI::Categorizer; require AI::Categorizer::Learner::NaiveBayes; require AI::Categorizer::Document; require AI::Categorizer::KnowledgeSet; require Lingua::StopWords; # set up features: # - give different weights to subjects and bodies # - use stop words my %features = (content_weights => {subject => 2, body => 1}, stopwords => Lingua::StopWords::getStopWords('en'), stemming => 'porter', ); # this is the raw data to train with, which associates # numerical categories with subjects and bodies my $chaps = { 6 => {subject => q{Data Type Utilities}, body => q{Date Time Math List Tree Algorithm Sort}, }, 10 => {subject => q{File Names Systems Locking}, body => q{Directory Dir Stat cwd}, }, 12 => {subject => q{Opt Arg Param Proc}, body => q{Option Argument Argv Config Getopt}, }, 14 => {subject => q{Security and Encryption}, body => q{Authentication Crypt Digest PGP Des}, }, 15 => {subject => q{World Wide Web HTML HTTP CGI}, body => q{WWW Apache MIME Kwiki URI URL}, }, 17 => {subject => q{Archiving and Compression}, body => q{tar gzip gz zip bzip}, }, 18 => {subject => q{Images Pixmaps Bitmaps}, body => q{Chart Graphic}, }, 19 => {subject => q{Mail and Usenet News}, body => q{Sendmail NNTP SMTP IMAP POP3 MIME}, }, }; # create documents from $chaps to train with my $docs; foreach my $cat(keys %$chaps) { $docs->{$cat} = {categories => [$cat], content => {subject => $chaps->{$cat}->{subject}, body => $chaps->{$cat}->{body}, }, }; } my $c = AI::Categorizer->new( knowledge_set => AI::Categorizer::KnowledgeSet->new( name => 'CSL'), verbose => 1, ); while (my ($name, $data) = each %$docs) { $c->knowledge_set->make_document(name => $name, %$data, %features); } my $learner = $c->learner; $learner->train; # this is a test data set to categorize, # based on the training done above my $test_set = {'Math::Complex' => {content => {subject => q{Math}, body => q{Complex number data type} } }, 'Archive::Zip' => {content => {subject => q{Compression}, body => q{Interface to ZIP archive files} } }, 'Apache2::URI' => {content => {subject => q{Apache}, body => q{Perl API for manipulating URIs} } }, 'MIME::Lite' => {content => {subject => q{Mail}, body => q{Create MIME/SMTP mails w/attachements} } }, }; # see what category each element of $test_set gets put into, # using a threshold score of 0.9 my $threshold = 0.9; while (my ($name, $data) = each %$test_set) { my $doc = AI::Categorizer::Document->new(name => $name, content => $data->{content}, %features); my $r = $learner->categorize($doc); $r->threshold($threshold); my $b = $r->best_category; next unless $r->in_category($b); printf("%s is in category %d, with score %.3f\n", $name, $b, $r->scores($b)); }
    This produces
    Archive::Zip is in category 17, with score 0.998 Apache2::URI is in category 15, with score 0.917 MIME::Lite is in category 19, with score 1.000 Math::Complex is in category 6, with score 0.997

      This is a quite old post, but maybe somebody could help with my doubt.

      I'm trying to use the threshold method as shown by randyk in the given example but the method is simply not working: doesn't matter what I give as a value, the method ignores the input. The AI::Categorizer::Hypothesis object $r does have a threshold attribute with a defined value, but how does it setup it is not clear in the documentation.

      Does anyone know how to define a threshold? I'm getting some results with lower scores that I don't want to work with.

      Alceu Rodrigues de Freitas Junior
      ---------------------------------
      "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://649937]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2024-04-25 14:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found