comment on

A better solution (if pseudo identical is not good enough as in the previous post) is to use Algorithm::Diff There is a good discussion of this from merlyn at The ever useful Web Techniques

Here is an implementation (we return 2 results as there are two answers ie similarity of A to B and B to A - not necessarily the same). Note I am pushing into arrays as your post hinted you were interested in the matches, if not then all you really need is counters. As an aside you could apply this approach to the detection of plagurism:

$, = '|';
$DEBUG = 1;

print compare( 'Hello', 'hello' ), $/;
print compare( 'Hello', 'HELLO WORLD' ), $/;
print compare( 'The quick brown fox jumped over the lazy dogs.', 
               'The quick brown dogs jumped over the lazy fox.' ), $/;
print compare( 'The quick brown fox jumped over the lazy dogs.', 
               'The quick brown fox jumped over the lazy kangaroo.' ),
+ $/;
print compare( 'The quick brown fox jumped over the lazy dogs.', 
               'The quick brown fox jumped, tripped and broke its neck
+.' ), $/;


use Algorithm::Diff qw(traverse_sequences);

sub compare {
    my ( $str1, $str2 ) = @_;
    print "\nCompare '$str1' <=> '$str2'\n" if $DEBUG;
    my $tok_str1 = tokenize($str1);
    my $tok_str2 = tokenize($str2);
    my (@match,@str1, @str2);
    traverse_sequences( $tok_str1, $tok_str2, {
        MATCH => sub { push @match, $tok_str1->[$_[0]] },
        DISCARD_A => sub { push @str1, $tok_str1->[$_[0]] },
        DISCARD_B => sub { push @str2, $tok_str2->[$_[1]] },
    });
    print "'@match' '@str1' '@str2'\n" if $DEBUG;
  return @match/(@match+@str1), @match/(@match+@str2);
}



sub tokenize {
    my ($str) = @_;
    # remove punctuation stuff
    $str =~ s/[^A-Za-z0-9 ]+//g;
    # lowercase
    $str = lc $str;
    # return array ref
  return [split ' ', $str];
}

__DATA__
Compare 'Hello' <=> 'hello'
'hello' '' ''
1|1|

Compare 'Hello' <=> 'HELLO WORLD'
'hello' '' 'world'
1|0.5|

Compare 'The quick brown fox jumped over the lazy dogs.' <=> 'The quic
+k brown dogs jumped over the lazy fox.'
'the quick brown jumped over the lazy' 'fox dogs' 'dogs fox'
0.777777777777778|0.777777777777778|

Compare 'The quick brown fox jumped over the lazy dogs.' <=> 'The quic
+k brown fox jumped over the lazy kangaroo.'
'the quick brown fox jumped over the lazy' 'dogs' 'kangaroo'
0.888888888888889|0.888888888888889|

Compare 'The quick brown fox jumped over the lazy dogs.' <=> 'The quic
+k brown fox jumped, tripped and broke its neck.'
'the quick brown fox jumped' 'over the lazy dogs' 'tripped and broke i
+ts neck'
0.555555555555556|0.5|
[download]

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

In reply to Re: calculate matching words/sentence by tachyon
in thread calculate matching words/sentence by anocelot

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Come for the quick hacks, stay for the epiphanies.
	PerlMonks