A better solution (if pseudo identical is not good enough as in the previous post) is to use Algorithm::Diff There is a good discussion of this from merlyn at The ever useful Web Techniques
Here is an implementation (we return 2 results as there are two answers ie similarity of A to B and B to A - not necessarily the same). Note I am pushing into arrays as your post hinted you were interested in the matches, if not then all you really need is counters. As an aside you could apply this approach to the detection of plagurism:
$, = '|';
$DEBUG = 1;
print compare( 'Hello', 'hello' ), $/;
print compare( 'Hello', 'HELLO WORLD' ), $/;
print compare( 'The quick brown fox jumped over the lazy dogs.',
'The quick brown dogs jumped over the lazy fox.' ), $/;
print compare( 'The quick brown fox jumped over the lazy dogs.',
'The quick brown fox jumped over the lazy kangaroo.' ),
+ $/;
print compare( 'The quick brown fox jumped over the lazy dogs.',
'The quick brown fox jumped, tripped and broke its neck
+.' ), $/;
use Algorithm::Diff qw(traverse_sequences);
sub compare {
my ( $str1, $str2 ) = @_;
print "\nCompare '$str1' <=> '$str2'\n" if $DEBUG;
my $tok_str1 = tokenize($str1);
my $tok_str2 = tokenize($str2);
my (@match,@str1, @str2);
traverse_sequences( $tok_str1, $tok_str2, {
MATCH => sub { push @match, $tok_str1->[$_[0]] },
DISCARD_A => sub { push @str1, $tok_str1->[$_[0]] },
DISCARD_B => sub { push @str2, $tok_str2->[$_[1]] },
});
print "'@match' '@str1' '@str2'\n" if $DEBUG;
return @match/(@match+@str1), @match/(@match+@str2);
}
sub tokenize {
my ($str) = @_;
# remove punctuation stuff
$str =~ s/[^A-Za-z0-9 ]+//g;
# lowercase
$str = lc $str;
# return array ref
return [split ' ', $str];
}
__DATA__
Compare 'Hello' <=> 'hello'
'hello' '' ''
1|1|
Compare 'Hello' <=> 'HELLO WORLD'
'hello' '' 'world'
1|0.5|
Compare 'The quick brown fox jumped over the lazy dogs.' <=> 'The quic
+k brown dogs jumped over the lazy fox.'
'the quick brown jumped over the lazy' 'fox dogs' 'dogs fox'
0.777777777777778|0.777777777777778|
Compare 'The quick brown fox jumped over the lazy dogs.' <=> 'The quic
+k brown fox jumped over the lazy kangaroo.'
'the quick brown fox jumped over the lazy' 'dogs' 'kangaroo'
0.888888888888889|0.888888888888889|
Compare 'The quick brown fox jumped over the lazy dogs.' <=> 'The quic
+k brown fox jumped, tripped and broke its neck.'
'the quick brown fox jumped' 'over the lazy dogs' 'tripped and broke i
+ts neck'
0.555555555555556|0.5|
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.