### comment on

 Need Help??

I have been meaning to release this for a while. It does pretty much exactly what you want:

```package Algorithm::HowSimilar;

use 5.006;
use strict;
use warnings;
use Algorithm::Diff qw(traverse_sequences);
use Carp;
require Exporter;
use vars qw( @ISA @EXPORT_OK \$VERSION );
our @ISA = qw(Exporter);
@EXPORT_OK = qw( compare );
\$VERSION = '0.01';

sub compare {
my \$is_array = ref \$_[0] eq 'ARRAY' ? 1 : 0;
my \$i = 0;
if ( \$is_array ) {
my \$seq1 = \$_[0];
my \$seq2 = \$_[1];
my (@match,@d1, @d2) = ((),(),());
traverse_sequences( \$seq1, \$seq2, {
MATCH     => sub { push @match, \$seq1->[\$_[0]] },
DISCARD_A => sub { push @d1, \$seq1->[\$_[0]] },
DISCARD_B => sub { push @d2, \$seq2->[\$_[1]] },
});
my \$m1 = @match/(@match+@d1);
my \$m2 = @match/(@match+@d2);
my \$mav = (\$m1+\$m2)/2;
return \$mav, \$m1, \$m2, \@match, \@d1, \@d2;
}
else {
my ( \$seq1, \$seq2 );
if ( \$_[2] and ref \$_[2] eq 'CODE' ) {
local \$_ = \$_[0]; \$seq1 = &{\$_[2]};
local \$_ = \$_[1]; \$seq2 = &{\$_[2]};
carp "Did not get an array ref from callback!\n"
unless ref \$seq1 eq 'ARRAY' and ref \$seq2 eq 'ARRAY';
}
else {
\$seq1 = _tokenize(\$_[0]);
\$seq2 = _tokenize(\$_[1]);
}
my (\$match,\$d1, \$d2) = ('','','');
traverse_sequences( \$seq1, \$seq2, {
MATCH     => sub { \$match .= \$seq1->[\$_[0]] },
DISCARD_A => sub { \$d1 .= \$seq1->[\$_[0]] },
DISCARD_B => sub { \$d2 .= \$seq2->[\$_[1]] },
});
my \$m1 = length(\$match)/(length(\$match)+length(\$d1));
my \$m2 = length(\$match)/(length(\$match)+length(\$d2));
my \$mav = (\$m1+\$m2)/2;
return \$mav, \$m1, \$m2, \$match, \$d1, \$d2;
}

}

sub _tokenize { return [split //, \$_[0]] }

1;
__END__

Algorithm::HowSimilar - Perl extension for quantifying similarites bet
+ween things

use Algorithm::HowSimilar qw(compare);
@res = compare( \$str1, \$str2, sub { s/\s+//g; [split //] } );
@res = compare( \@ary1, \@ary2 );

This module leverages Algorithm::Diff to let you compare the degree of
+ sameness
of array or strings. It returns a result set that defines exactly how
+similar
these things are.

=head2 compare( ARG1, ARG2, OPTIONAL_CALLBACK )

You can call compare with either two strings compare( \$str1, \$str2 ):

my ( \$av_similarity,
\$sim_str1_to_str2,
\$sim_str2_to_str1,
\$matches,
\$in_str1_but_not_str2,
\$in_str2_but_not_str1
) = compare( 'this is a string-a', 'this is a string bbb' );

Note that the mathematical similarities of one string to another will
+be
different unless the strings have the same length. The first result re
+turned
is the average similarity. Totally dissimilar strings will return 0. I
+dentical
strings will return 1. The degree of similarity therefore ranges from
+0-1 and
is reported as the biggest float your OS/Perl can manage.

You can also compare two array refs compare( \@ary1, \@ary2 ):

my ( \$av_similarity,
\$sim_ary1_to_ary2,
\$sim_ary2_to_ary1,
\$ref_ary_matches,
\$ref_ary_in_ary1_but_not_ary2,
\$ref_ary_in_ary2_but_not_ary1
) = compare( [ 1,2,3,4 ], [ 3,4,5,6,7 ] );

When called with two string you can specify an optional callback that
+changes
the default tokenization of strings (a simple split on null) to whatev
+er you
need. The strings are passed to you callback in \$_ and the sub is expe
+cted to
return an array ref. So for example to ignore all
whitespace you could:

@res = compare( 'this is a string',
'this is a string ',
sub { s/\s+//g; [split //] }
);

You already get the intersection of the strings or arrays. You can get
+ the
union like this:

@res = compare( \$str1, \$str2 );
\$intersection = \$res[3];
\$union = \$res[3].\$res[4].\$res[5];
@res = compare( \@ary1, \@ary2 );
@intersection = @{\$res[3]};
@union = ( @{\$res[3]}, @{\$res[4]}, @{\$res[5]} );

None by default.

Dr James Freeman <james.freeman@id3.org.uk>

L<perl>.

=cut

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"\$'\$`\$\"\$\&"&ee&&y&srve&&d&&print

In reply to Re: Module for comparing text by tachyon
in thread Module for comparing text by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

• Are you posting in the right place? Check out Where do I post X? to know for sure.
• Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
• Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
• Want more info? How to link or or How to display code and escape characters are good places to start.

Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2021-12-03 12:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
R or B?

Results (29 votes). Check out past polls.

Notices?