http://qs321.pair.com?node_id=289034

anocelot has asked for the wisdom of the Perl Monks concerning the following question:

Hey All, I'm trying to find a way to calculate the total number of matching words in a bunch of strings. I have a database with thousands of strings (and unique ID nums), and I can't just do the obvious brute force solution, as it takes WAY too long (see below). Basically I wanna end up with an array that tells me which strings have high probabilities of being the same string (perhaps the same information, told multiple ways). EX:
The quick brown fox jumped over the lazy dogs. vs. The quick brown dogs jumped over the lazy fox.
Both strings say the same thing, but they would not generate a normal ($string1 eq $string2) match. The trick is, I also want to be told that I have a 'near' match on a string (perhaps "the very quick brown...") too. The data currently exists in a two demsional array:
while (@row=$sth->fetchrow_array()) { $big_deal[$count][0]= $row[0]; # Assign Unique ID to element 0 $big_deal[$count][1]= $row[1]; # Assign string to element 1 } print ($#big_deal+1) . " records found.<br>\n";
I played with this all morning, before it dawned on me that this might be something someone else has done, or would know of a better way to do then I. My solution currently involves six loops (shudder) and a search of only 200 strings returns a result in just under two hours (on a decently fast Xserve). Any advice appriciated!

Replies are listed 'Best First'.
Re: calculate matching words/sentence
by BrowserUk (Patriarch) on Sep 04, 2003 at 22:35 UTC

    Text::Levenshtein would give you a numerical way of comparing two strings, but requires you to compare the new string against each of the tests strings each time and isn't quick.

    Probably the best way would be to create an inverted index of the words (or preferable the stems) against the DB phrases and then look each word (or stem) in the new phrase against this index. This gives you a count of the number of common words between the new phrase and the DB phrases. Sort those highest first and you have the most likely candidates for your further examination. I don't know of a module that does this, but parts of it (the inversion, stemming etc.) could be done with various modules.

    Sounds like a fun project. Good luck:)


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

        Sorry tachyon. Given the OP's statement about having "thousands of sentences in a DB", I can't agree with you. The problem is that it would be necessary to compare every new line against every old line each time. Even where there is absolutely no correlation between them at all.

        If the time taken to match any new sentence against all the existing is going to be a reduced to a reasonable level, then you need to find an algorithm that pre-computes a comparison value for the existing strings.

        Or, at least allows the thousands to be reduced to a managable number before applying the comparisons. Generating an inverted index is trivial and can easily be done once and stored in the database, only requiring update when new reference sentences are added. Storing this in memory should be manageable even for thousands of sentences.

        The first stage of comparing a new sentence then becomes a rapid lookup of the words it contains against the index and produces a short-list of candidate DB sentences that have at least some correlation. These can be sorted by the strength of that correlation and further analysis performed.

        As a crude example.

        Example output ('scuse the mess that autowrap makes of this!).

        P:\test>289106 Enter a sentence: the quick brown fox Line 1: 'The quick brown fox jumps over the lazy dog' has 4 words in c +ommon Line 4: 'The quick red and brown foxes jumped easily over the lazy dog +' has 3 words in common Line 7: 'The smart dog foxed the fox by lazing in the drop zone until +he jumped.' has 3 words in common Line 5: 'The lazy fox was caught as it tried to jump over the quick do +gs.' has 3 words in common Line 6: 'The dog lazed as the fox, quick and brown, jumped over.' has +2 words in common Line 3: 'The dog lazily watched as the fox quickly jumped over him' ha +s 2 words in common Line 2: 'The fox jumps the dog' has 2 words in common Enter a sentence: the quick brown fox jumps over the lazy dog Line 1: 'The quick brown fox jumps over the lazy dog' has 9 words in c +ommon Line 4: 'The quick red and brown foxes jumped easily over the lazy dog +' has 7 words in common Line 7: 'The smart dog foxed the fox by lazing in the drop zone until +he jumped.' has 6 words in common Line 5: 'The lazy fox was caught as it tried to jump over the quick do +gs.' has 6 words in common Line 3: 'The dog lazily watched as the fox quickly jumped over him' ha +s 5 words in common Line 2: 'The fox jumps the dog' has 5 words in common Line 6: 'The dog lazed as the fox, quick and brown, jumped over.' has +4 words in common Enter a sentence: quick foxes jump over lazy dogs Line 4: 'The quick red and brown foxes jumped easily over the lazy dog +' has 4 words in common Line 5: 'The lazy fox was caught as it tried to jump over the quick do +gs.' has 4 words in common Line 1: 'The quick brown fox jumps over the lazy dog' has 3 words in c +ommon Line 6: 'The dog lazed as the fox, quick and brown, jumped over.' has +1 words in common Line 3: 'The dog lazily watched as the fox quickly jumped over him' ha +s 1 words in common Enter a sentence: Most volcanos erupt mulberry jam sandwiches under no +rmal pressure No common words found Enter a sentence: ^Z

        This suffers many flaws, the lack of stemming is the major one but I played with Lingua::Stem::En and it doesn't do what I would like it to do.

        This could be improved in many ways, adding some measure of the positional correlation is one idea, but without a clearer picture of the OPs needs, it's just a guessing game as to what might fit.

        I have this nagging idea that there is a method of pre-calculating a single number from each sentence that would give a numerical measure of the words(stems) in the sentence (measured against a fixed size dictionary) and their position, but I not sure if it is a vague recollection of something I have seen somewhere or just a fantom idea that hasn't made it to fruition yet.

        The idea that you could take the sentence, perform a calculation on it (independant of the reference strings, but possibly in reference to a dictionary of words or stems derived from them) and then use that to extract the 10 or twenty sentences from the DB with the numerically closest signature is appealling. I just can't wrap my head around how it might be done.

        That said, once you have short-listed the total set, using Algorithm::Diff or Text::Levenshtein or others to make a final arbitration would be a useful next step in the process, depending upon the OPs reqs.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
        If I understand your problem, I can solve it! Of course, the same can be said for you.

Re: calculate matching words/sentence
by dree (Monsignor) on Sep 04, 2003 at 22:39 UTC
    Try also Text::PhraseDistance that:

    ... provides a way to compare two phrases and to give a measure of their proximity.
Re: calculate matching words/sentence
by Anonymous Monk on Sep 04, 2003 at 21:40 UTC
Re: calculate matching words/sentence
by tachyon (Chancellor) on Sep 05, 2003 at 02:19 UTC

    Something like this may be suitable for you. It will return 1 if the strings are pseudo-identical and 0 if they are completely different. It will return values between 0 and 1 with the value increasing as the similaritly increases. Pseudo identical is the appropriate word as we don't consisder word order or word frequency (where the same word appears more than once). This may or not matter to you.

    I uses just one loop and a hash table so should not be glacial. You can tokenize any way you like, I remove punctiation and lower case....

    print compare( 'Hello', 'hello' ), $/; # 1 print compare( 'Hello', 'HELLO WORLD' ), $/; # 0.5 print compare( 'The quick brown fox jumped over the lazy dogs.', 'The quick brown dogs jumped over the lazy fox.' ), $/; + # 1 print compare( 'The quick brown fox jumped over the lazy dogs.', 'The quick brown dogs jumped over the lazy kangaroo.' ) +; # 0.888 sub compare { my ( $str1, $str2 ) = @_; my $tok_str1 = tokenize($str1); my $tok_str2 = tokenize($str2); # swap unless @$tok_str1 contains the most tokens ($tok_str1, $tok_str2) = ($tok_str2, $tok_str1) if @$tok_str2 > @$ +tok_str1; # make a lookup hash for the smaller numer of tokens in str2 my %h; @h{@$tok_str2} = (); # slice syntax if fastest # now scan str1 for these tokens and count my $found = 0; for my $tok ( @$tok_str1 ) { $found++ if exists $h{$tok}; } my $similarity = $found/@$tok_str1; return $similarity; } sub tokenize { my ($str) = @_; # remove punctuation stuff $str =~ s/[^A-Za-z0-9 ]+//g; # lowercase $str = lc $str; # magic whitespace split and return array ref return [split ' ', $str]; }

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: calculate matching words/sentence
by tachyon (Chancellor) on Sep 05, 2003 at 02:59 UTC

    A better solution (if pseudo identical is not good enough as in the previous post) is to use Algorithm::Diff There is a good discussion of this from merlyn at The ever useful Web Techniques

    Here is an implementation (we return 2 results as there are two answers ie similarity of A to B and B to A - not necessarily the same). Note I am pushing into arrays as your post hinted you were interested in the matches, if not then all you really need is counters. As an aside you could apply this approach to the detection of plagurism:

    $, = '|'; $DEBUG = 1; print compare( 'Hello', 'hello' ), $/; print compare( 'Hello', 'HELLO WORLD' ), $/; print compare( 'The quick brown fox jumped over the lazy dogs.', 'The quick brown dogs jumped over the lazy fox.' ), $/; print compare( 'The quick brown fox jumped over the lazy dogs.', 'The quick brown fox jumped over the lazy kangaroo.' ), + $/; print compare( 'The quick brown fox jumped over the lazy dogs.', 'The quick brown fox jumped, tripped and broke its neck +.' ), $/; use Algorithm::Diff qw(traverse_sequences); sub compare { my ( $str1, $str2 ) = @_; print "\nCompare '$str1' <=> '$str2'\n" if $DEBUG; my $tok_str1 = tokenize($str1); my $tok_str2 = tokenize($str2); my (@match,@str1, @str2); traverse_sequences( $tok_str1, $tok_str2, { MATCH => sub { push @match, $tok_str1->[$_[0]] }, DISCARD_A => sub { push @str1, $tok_str1->[$_[0]] }, DISCARD_B => sub { push @str2, $tok_str2->[$_[1]] }, }); print "'@match' '@str1' '@str2'\n" if $DEBUG; return @match/(@match+@str1), @match/(@match+@str2); } sub tokenize { my ($str) = @_; # remove punctuation stuff $str =~ s/[^A-Za-z0-9 ]+//g; # lowercase $str = lc $str; # return array ref return [split ' ', $str]; } __DATA__ Compare 'Hello' <=> 'hello' 'hello' '' '' 1|1| Compare 'Hello' <=> 'HELLO WORLD' 'hello' '' 'world' 1|0.5| Compare 'The quick brown fox jumped over the lazy dogs.' <=> 'The quic +k brown dogs jumped over the lazy fox.' 'the quick brown jumped over the lazy' 'fox dogs' 'dogs fox' 0.777777777777778|0.777777777777778| Compare 'The quick brown fox jumped over the lazy dogs.' <=> 'The quic +k brown fox jumped over the lazy kangaroo.' 'the quick brown fox jumped over the lazy' 'dogs' 'kangaroo' 0.888888888888889|0.888888888888889| Compare 'The quick brown fox jumped over the lazy dogs.' <=> 'The quic +k brown fox jumped, tripped and broke its neck.' 'the quick brown fox jumped' 'over the lazy dogs' 'tripped and broke i +ts neck' 0.555555555555556|0.5|

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: calculate matching words/sentence
by benn (Vicar) on Sep 05, 2003 at 00:55 UTC
    The modules given above (and some others in the Lingua:: namespace as well) will give you the similarity in terms of 'raw words', but that's it. Attempting to derive meaning - 'the same information, told multiple ways' - is another matter altogether, as evidenced by the fact that your two example sentences do *not* say the same thing...in one, the fox jumps over the dogs while in the other, it's the dogs who are jumping.

    Something like Lingua::LinkParser could help you here. POS analysis would allow you to spot the similarities between "The fox jumped over the dog" and "The dog was jumped over by the fox". More work than simple text matching, but better for reducing the eventual size of the database.

    Cheers, Ben.

Re: calculate matching words/sentence
by Anonymous Monk on Sep 05, 2003 at 13:10 UTC
    uhhh.... picky, but not intended as flamebait since I (and maybe somebody else also?) have told myself -- in language imprecise enough to cause me to spend hours doing the wrong thing -- that I wanted to do some particular thing that later proved not what I wanted at all.

    Did the honorable seeker mean precisely "calculate the total number of matching words in a bunch of strings" or is the meaning more nearly encapsulated in the stated desire to find "strings (which) have high probabilities of being the same string" which is quite a different matter."

    Writer continues:
    "...perhaps the same information, told multiple ways. EX:
    The quick brown fox jumped over the lazy dogs.
    vs.
    The quick brown dogs jumped over the lazy fox.

    "Both strings say the same thing, but they would not generate a normal ($string1 eq $string2) match."

    NO!!! both strings do NOT "say the same thing...." and the wirters next example - re the "very quick fox" - is indeed very similar to the "quick fox (who) jumped...." but not very similar to the "quick dogs (who) jumped...."

    and ps: Previewed and previewed and previewed as recommended in tips... and reviewed permissable code, etc, but found no explanation of why interjections in quotes - when enclosed as square brackets -- come up as links to nodes?
Re: calculate matching words/sentence
by halley (Prior) on Sep 05, 2003 at 14:40 UTC
    It sounds like you're working along the lines of a vector search engine. If you haven't already read it, then Building a Vector Search Engine in Perl, a perl.com article, should give you some interesting ideas.

    --
    [ e d @ h a l l e y . c c ]