comment on

Sorry tachyon. Given the OP's statement about having "thousands of sentences in a DB", I can't agree with you. The problem is that it would be necessary to compare every new line against every old line each time. Even where there is absolutely no correlation between them at all.

If the time taken to match any new sentence against all the existing is going to be a reduced to a reasonable level, then you need to find an algorithm that pre-computes a comparison value for the existing strings.

Or, at least allows the thousands to be reduced to a managable number before applying the comparisons. Generating an inverted index is trivial and can easily be done once and stored in the database, only requiring update when new reference sentences are added. Storing this in memory should be manageable even for thousands of sentences.

The first stage of comparing a new sentence then becomes a rapid lookup of the words it contains against the index and produces a short-list of candidate DB sentences that have at least some correlation. These can be sorted by the strength of that correlation and further analysis performed.

As a crude example.

#! perl -slw
use strict;
use Data::Dumper;

# Read an array of the DB sentences and index them.
my @lines;
my %lookup;
while( my $line = <DATA> ) {
    push @{ $lookup{ $_ } }, $. for split ' ', $line;
    $lines[ $. ] = $line;
}
close DATA;
chomp @lines;


# Process each newline by
while( 1 ) {
    printf "\nEnter a sentence: "; my $newline = <STDIN>;
    last unless $newline;
    my @words = split ' ', $newline;

    # building a hash with the sentence (line) number as key
    # and the number of words it has in common
    # with the new sentence as the value

    my %occurs;
    $occurs{ $_ }++ for map{
        my $aref = $lookup{ $_ };
        defined $aref
        ? @$aref
        : ()
    } @words;

    print "No common words found" and next unless keys %occurs;

    # Process (display) only thise lines that have some correlation
    # sort by the strength of the correlation.

    print "Line $_: '$lines[ $_ ]' has $occurs{ $_ } words in common",
        for sort{ $occurs{ $b } <=> $occurs{ $a } } keys %occurs
}
__DATA__
The quick brown fox jumps over the lazy dog
The fox jumps the dog
The dog lazily watched as the fox quickly jumped over him
The quick red and brown foxes jumped easily over the lazy dog
The lazy fox was caught as it tried to jump over the quick dogs.
The dog lazed as the fox, quick and brown, jumped over.
The smart dog foxed the fox by lazing in the drop zone until he jumped
+.
[download]

Example output ('scuse the mess that autowrap makes of this!).

P:\test>289106

Enter a sentence: the quick brown fox
Line 1: 'The quick brown fox jumps over the lazy dog' has 4 words in c
+ommon
Line 4: 'The quick red and brown foxes jumped easily over the lazy dog
+' has 3 words in common
Line 7: 'The smart dog foxed the fox by lazing in the drop zone until 
+he jumped.' has 3 words in common
Line 5: 'The lazy fox was caught as it tried to jump over the quick do
+gs.' has 3 words in common
Line 6: 'The dog lazed as the fox, quick and brown, jumped over.' has 
+2 words in common
Line 3: 'The dog lazily watched as the fox quickly jumped over him' ha
+s 2 words in common
Line 2: 'The fox jumps the dog' has 2 words in common

Enter a sentence: the quick brown fox jumps over the lazy dog
Line 1: 'The quick brown fox jumps over the lazy dog' has 9 words in c
+ommon
Line 4: 'The quick red and brown foxes jumped easily over the lazy dog
+' has 7 words in common
Line 7: 'The smart dog foxed the fox by lazing in the drop zone until 
+he jumped.' has 6 words in common
Line 5: 'The lazy fox was caught as it tried to jump over the quick do
+gs.' has 6 words in common
Line 3: 'The dog lazily watched as the fox quickly jumped over him' ha
+s 5 words in common
Line 2: 'The fox jumps the dog' has 5 words in common
Line 6: 'The dog lazed as the fox, quick and brown, jumped over.' has 
+4 words in common

Enter a sentence: quick foxes jump over lazy dogs
Line 4: 'The quick red and brown foxes jumped easily over the lazy dog
+' has 4 words in common
Line 5: 'The lazy fox was caught as it tried to jump over the quick do
+gs.' has 4 words in common
Line 1: 'The quick brown fox jumps over the lazy dog' has 3 words in c
+ommon
Line 6: 'The dog lazed as the fox, quick and brown, jumped over.' has 
+1 words in common
Line 3: 'The dog lazily watched as the fox quickly jumped over him' ha
+s 1 words in common

Enter a sentence: Most volcanos erupt mulberry jam sandwiches under no
+rmal pressure
No common words found

Enter a sentence: ^Z
[download]

This suffers many flaws, the lack of stemming is the major one but I played with Lingua::Stem::En and it doesn't do what I would like it to do.

This could be improved in many ways, adding some measure of the positional correlation is one idea, but without a clearer picture of the OPs needs, it's just a guessing game as to what might fit.

I have this nagging idea that there is a method of pre-calculating a single number from each sentence that would give a numerical measure of the words(stems) in the sentence (measured against a fixed size dictionary) and their position, but I not sure if it is a vague recollection of something I have seen somewhere or just a fantom idea that hasn't made it to fruition yet.

The idea that you could take the sentence, perform a calculation on it (independant of the reference strings, but possibly in reference to a dictionary of words or stems derived from them) and then use that to extract the 10 or twenty sentences from the DB with the numerically closest signature is appealling. I just can't wrap my head around how it might be done.

That said, once you have short-listed the total set, using Algorithm::Diff or Text::Levenshtein or others to make a final arbitration would be a useful next step in the process, depending upon the OPs reqs.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.

In reply to Re: Re: Re: calculate matching words/sentence by BrowserUk
in thread calculate matching words/sentence by anocelot

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


The stupid question is the question not asked
	PerlMonks