Re: Guessing/Ordering Partial Data

I'd like to point you somewhere and then offer my own swing at this.

One approach is to make a reverse index. You might like to check out an article that's an old favorite of mine on Building a Vector Space Search Engine in Perl.

Also Lingua::Stem::Fr may help improve accuracy. Also you can use the above article's suggestion of keeping a bad words list and remove de, la, du, etc. from your dictionary.

But in your guesses you seem to want to do phrase matching, and this is not directly supported. There are more sophisticated algorithms but if you want phrases I'd say the brute force with grepping and keeping track of hits is best for this case, it is not so difficult algorithmically and for only a hundred items it will not be slow if you only loop through once for each word. Note a hash key can have spaces in it.

That said, here is my shot at it. My strategy was simple, and has the added attraction of keeping score, only showing the highest scoring hits, and allowing you to search for phrases. (at least it seems to work that way so far). If you want to use the command line, take a look at @ARGV.

#!/cygdrive/c/Perl/bin/perl

# http://www.perlmonks.org/?node_id=447234

my @loc = ();
my $x;
while (<DATA>) {
    lc; chomp;
    push (@loc,$_);
}

#print "Available locations:\n" . join("\n", sort @loc);

my %score = ();
#my @phrases = ("Place de la Gare", "Rennes");
my @phrases = ("gare","er","n");
my $phrase;

foreach $phrase (@phrases) {
    my @matches = grep(/$phrase/i, @loc);
    foreach my $match (@matches) {
    $score{$match}++;
    }
}

my $hiscore = 0;
foreach my $hit (keys %score) {
    my $s = $score{$hit};
    $hiscore = $s if $s > $hiscore;
    push (@{$hits[$s]},$hit);
}

# just print highest scoring ones

print "Top scoring matches with a score of $hiscore:\n";
foreach my $toploc (@{$hits[$hiscore]}) {
    print "$toploc\n";
}

__DATA__
Place De La Gare - Angers
Place De La Gare - Nevers
Place Mohammed V -  Oujda
Place De La Gare - Rennes
Place de la Gare - Quimper
Place Thiers -  Nancy
Place De La Gare -  Grenoble
Place Du Chateau - Galerie Marchande Du Rer
Place De La Gare -  Angers
Place De La Gare 1 - Bannes Grenoble
Place De La Gare -  Nevers
Place De La Gare -  Rennes
Place De La Gare bannes
Place de la Gare
Place de la Gare - Bergerac
Place de la Gare - Moutiers
Place de la Gare - Libourne
[download]

Comment on Re: Guessing/Ordering Partial Data Download Code


Problems? Is your data what you think it is?
	PerlMonks