Give
Text::Similarity::Overlaps a try. For example:
#!/usr/bin/perl -l
use strict;
no strict 'refs';
use warnings;
use Text::Similarity::Overlaps;
my( %opt ) = (
verbose => 1,
Text::Similarity::NORMALIZE => 1,
);
my $mod = Text::Similarity::Overlaps->new( \%opt );
die "$mod failed" unless defined $mod;
my $file1 =
"/usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm";
my $file2 =
"/usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm";
open $file1, '<', \*STDOUT or die $!;
binmode STDOUT, ":encoding(UTF-8)";
open $file2, '<', \*STDOUT or die $!;
binmode STDOUT, ":encoding(UTF-8)";
my $score = $mod->getSimilarity( $file1, $file2 );
print "The similarity of $file1 and file2 is: $score";
close( $file1 );
close( $file2 );
It'll take a few minutes, but it comes back with a score.
In this case, the result was:
0.999615754082613 for
two files exactly the same.
For two completely different files:
#!/usr/bin/perl -l
use strict;
no strict 'refs';
no warnings::anywhere qw(uninitialized);
use Text::Similarity::Overlaps;
use warnings qw(uninitialized);
my( %opt ) = (
verbose => 1,
Text::Similarity::NORMALIZE => 1,
);
my $mod = Text::Similarity::Overlaps->new( \%opt );
die "$mod failed" unless defined $mod;
my $file1 =
"/usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm";
my $file2 =
"/usr/local/lib/perl5/site_perl/5.10.0/POE.pm";
open $file1, '<', \*STDOUT or die $!;
binmode STDOUT, ":encoding(UTF-8)";
open $file2, '<', \*STDOUT or die $!;
binmode STDOUT, ":encoding(UTF-8)";
my $score = $mod->getSimilarity( $file1, $file2 );
print "The similarity of the two files is: $score";
close( $file1 );
close( $file2 );
The smilarity score for two completely different files came back at:
0.345969033635878
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.