Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Guessing/Ordering Partial Data

by mattr (Curate)
on Apr 14, 2005 at 07:29 UTC ( [id://447667]=note: print w/replies, xml ) Need Help??


in reply to Guessing/Ordering Partial Data

I'd like to point you somewhere and then offer my own swing at this.

One approach is to make a reverse index. You might like to check out an article that's an old favorite of mine on Building a Vector Space Search Engine in Perl.

Also Lingua::Stem::Fr may help improve accuracy. Also you can use the above article's suggestion of keeping a bad words list and remove de, la, du, etc. from your dictionary.

But in your guesses you seem to want to do phrase matching, and this is not directly supported. There are more sophisticated algorithms but if you want phrases I'd say the brute force with grepping and keeping track of hits is best for this case, it is not so difficult algorithmically and for only a hundred items it will not be slow if you only loop through once for each word. Note a hash key can have spaces in it.

That said, here is my shot at it. My strategy was simple, and has the added attraction of keeping score, only showing the highest scoring hits, and allowing you to search for phrases. (at least it seems to work that way so far). If you want to use the command line, take a look at @ARGV.

#!/cygdrive/c/Perl/bin/perl # http://www.perlmonks.org/?node_id=447234 my @loc = (); my $x; while (<DATA>) { lc; chomp; push (@loc,$_); } #print "Available locations:\n" . join("\n", sort @loc); my %score = (); #my @phrases = ("Place de la Gare", "Rennes"); my @phrases = ("gare","er","n"); my $phrase; foreach $phrase (@phrases) { my @matches = grep(/$phrase/i, @loc); foreach my $match (@matches) { $score{$match}++; } } my $hiscore = 0; foreach my $hit (keys %score) { my $s = $score{$hit}; $hiscore = $s if $s > $hiscore; push (@{$hits[$s]},$hit); } # just print highest scoring ones print "Top scoring matches with a score of $hiscore:\n"; foreach my $toploc (@{$hits[$hiscore]}) { print "$toploc\n"; } __DATA__ Place De La Gare - Angers Place De La Gare - Nevers Place Mohammed V - Oujda Place De La Gare - Rennes Place de la Gare - Quimper Place Thiers - Nancy Place De La Gare - Grenoble Place Du Chateau - Galerie Marchande Du Rer Place De La Gare - Angers Place De La Gare 1 - Bannes Grenoble Place De La Gare - Nevers Place De La Gare - Rennes Place De La Gare bannes Place de la Gare Place de la Gare - Bergerac Place de la Gare - Moutiers Place de la Gare - Libourne

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://447667]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (6)
As of 2024-04-19 02:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found