Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Finding similar data

by blackpitch (Acolyte)
on Jun 14, 2005 at 06:36 UTC ( [id://466405]=perlquestion: print w/replies, xml ) Need Help??

blackpitch has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to find similar objects in a data set.
For example, I have
sem
sema
cat
semok
hgt
cato
seme
cate
A function would take the data & show that, according to a certain criteria (first 3 letters in this case):
sem, sema, semok, seme are similar
cat, cato, cate are similar
The problem is I don't know what to call this method in order for me to search on CPAN for modules handling such a search method or even on google to find some algorithms.
Taking the first 3 letters is just an example. I want a general algorithm to find the data. To give a clearer example:
A user might search in a database for Jons & the results would be: Jons
Jones
Joneses
Jins

Replies are listed 'Best First'.
Re: Finding similar data
by kaif (Friar) on Jun 14, 2005 at 07:05 UTC
Re: Finding similar data
by davidj (Priest) on Jun 14, 2005 at 06:57 UTC
    You could use the substr method to get the first 3 letters and use a hash of arrays to store the matching strings. While probably not the most efficient, the following might be a good place for you to start:
    #!/usr/bin/perl use strict; use Data::Dumper; my (@words, $substr, %hash); @words = qw/sem sema cat semok hgt cato seme cate/; foreach (@words) { $substr = substr( $_, 0, 3 ); push( @{ $hash{$substr} }, $_ ); } print Dumper(\%hash); exit; $VAR1 = { 'cat' => [ 'cat', 'cato', 'cate' ], 'sem' => [ 'sem', 'sema', 'semok', 'seme' ], 'hgt' => [ 'hgt' ] };

    hope this helps,
    davidj
Re: Finding similar data
by monarch (Priest) on Jun 14, 2005 at 06:57 UTC
    I'm sure there's a module on CPAN.

    But if your query is as simple as only sorting to first 3 letters, then the following may do the trick:

    my @words = qw(sem sema cat semok hgt cato seme cate); my %similar = (); foreach my $word ( @words ) { push( @{$similar{substr($word,0,3)}}, $word ); } foreach my $prefix ( sort keys %similar ) { print( join( ', ', @{$similar{$prefix}} ) ); print( "\n" ); }
    prints:
    cat, cato, cate hgt sem, sema, semok, seme
Re: Finding similar data
by planetscape (Chancellor) on Jun 14, 2005 at 08:37 UTC
Re: Finding similar data
by TedPride (Priest) on Jun 14, 2005 at 07:41 UTC
    It's easy to sort the words into sets if it's you're just matching the first 3 characters. See above. But what if the criteria are more complex? What if the words are considered a match if a block of letters inside the words comprising at least x% of the total letters matches? You can't use keys and hashes for that. Does every word in a set have to match every other word in the set, or can the words be linked by "joiner" words that may match two or more words that don't match each other? Both situations create interesting algorithmic problems. Probably the best thing to do is separate your match criteria from your set criteria by having match be a sub:
    my $match = sub { return substr($_[0], 0, 3) eq substr($_[1], 0, 3); } +; print $match->('aaaa','aaab');
    And then passing it to your set sub:
    myset($match, @words); sub myset { my $match = shift; # Do whatever set algorithm here, # using $match to compare words }
    This allows you to easily test multiple match criteria without messy duplication of code. I'd have included set code too, but you weren't very specific as to what you really want in terms of match / set criteria. First three letters won't get you much of anywhere.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://466405]
Approved by jbrugger
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (3)
As of 2024-04-25 19:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found