Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Similarity measurement

by kennedy (Initiate)
on Apr 18, 2015 at 13:47 UTC ( [id://1123881]=perlquestion: print w/replies, xml ) Need Help??

kennedy has asked for the wisdom of the Perl Monks concerning the following question:

I have a question: how do we measure the overall similarity of two text files based on the word content and report the similarity in percentage. for example: - exactly the same, similarity should be 100%. - completely different (i.e. no single word in one file can be found in the other), similarity should be 0%.

Replies are listed 'Best First'.
Re: Similarity measurement
by hdb (Monsignor) on Apr 18, 2015 at 14:02 UTC

    It really depends on the objective. You could count the occurrence of each word and compare frequencies. They could be identical but the text files could be completely different. A single 'not' could change the meaning of a text into its opposite. The 'diff' tool is useful to compare text files but mainly based on lines. You could consider your text file as a sequence of words and then apply a longest common subsequence algorithm as in Algorithm::Diff.

Re: Similarity measurement
by ww (Archbishop) on Apr 19, 2015 at 03:34 UTC

    As hdb told you, "(i)t really depends...."

    Here are a just few more ways that "it...depends."

    1. Are two files with variant counts of a given word "exactly the same?"
    2. Does a pair of files in which one has certain word capitalized and the other has that word but with all chars in all lower case satisfy your spec?
    3. How do you wish to categorize a pair of files in which two different forms of a particular word occur; as, for example, when one file has a word hyphenated (at the end of a line) and the other file has the word in a position where it is not hyphenated.

    Despite the wisdom offered by others (above), I suspect this is do-able.

    I suspect one might tackle this problem by using sorted arrays of the two files, and testing for matches by position (which will require the test to skip over every instance after the first of any word which appears multiple times in one or the other files, or a word which appears multiple times in both... but a different number of multiple times.

    Another possibility which might be worth exploring would be to use hashes to count instances of each word in each file (and perhaps cast those to a second set of sorted arrays where each array element has the word and count (key and value pairs) from the hash) and then...?

    Well, one could use a regex to compare ( by position) the word (key) elements in the second set of sorted arrays and decide -- accounting for case or count or both or neither -- if you'll accept a pair as "exactly the same" or not.

    The arithmetic for determining the similarity percentage is left as an exercise for the OP [   :-)   ] ... or, someone with better brains or more free time than I have at the moment.

    UPDATE: 0740 EDT 20150419:

    Found in C:\Perl\lib\pods\perlfaq4.pod
      How do I test whether two arrays or hashes are equal?
        With Perl 5.10 and later, the smart match operator can give you the
        answer with the least amount of work:
    
            use 5.010;
    
            if( @array1 ~~ @array2 ) {
                say "The arrays are the same";
            }
    
            if( %hash1 ~~ %hash2 ) # doesn't check values!  {    # <- !!!
                say "The hash keys are the same";
            }
    
        ....

    Sometimes, a fresh dawn and fresh coffee are helpful in finding the obvious.

    Update2 1230 EDT 20150421: See File Similarity Concept (re [id://1123881]) for a proof of concept

Re: Similarity measurement
by Khen1950fx (Canon) on Apr 18, 2015 at 22:33 UTC
    Give Text::Similarity::Overlaps a try. For example:
    #!/usr/bin/perl -l use strict; no strict 'refs'; use warnings; use Text::Similarity::Overlaps; my( %opt ) = ( verbose => 1, Text::Similarity::NORMALIZE => 1, ); my $mod = Text::Similarity::Overlaps->new( \%opt ); die "$mod failed" unless defined $mod; my $file1 = "/usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm"; my $file2 = "/usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm"; open $file1, '<', \*STDOUT or die $!; binmode STDOUT, ":encoding(UTF-8)"; open $file2, '<', \*STDOUT or die $!; binmode STDOUT, ":encoding(UTF-8)"; my $score = $mod->getSimilarity( $file1, $file2 ); print "The similarity of $file1 and file2 is: $score"; close( $file1 ); close( $file2 );
    It'll take a few minutes, but it comes back with a score. In this case, the result was: 0.999615754082613 for two files exactly the same.

    For two completely different files:

    #!/usr/bin/perl -l use strict; no strict 'refs'; no warnings::anywhere qw(uninitialized); use Text::Similarity::Overlaps; use warnings qw(uninitialized); my( %opt ) = ( verbose => 1, Text::Similarity::NORMALIZE => 1, ); my $mod = Text::Similarity::Overlaps->new( \%opt ); die "$mod failed" unless defined $mod; my $file1 = "/usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm"; my $file2 = "/usr/local/lib/perl5/site_perl/5.10.0/POE.pm"; open $file1, '<', \*STDOUT or die $!; binmode STDOUT, ":encoding(UTF-8)"; open $file2, '<', \*STDOUT or die $!; binmode STDOUT, ":encoding(UTF-8)"; my $score = $mod->getSimilarity( $file1, $file2 ); print "The similarity of the two files is: $score"; close( $file1 ); close( $file2 );
    The smilarity score for two completely different files came back at: 0.345969033635878
Re: Similarity measurement
by Marshall (Canon) on Apr 18, 2015 at 23:48 UTC
    This is a very complicated problem. Very complicated.

    One technique is Text::Levenshtein - calculate the Levenshtein edit distance between two strings.

    This is not easy even for a human to do. There are many language differences in syntax that this is almost impossible for a computer to do it.

Re: Similarity measurement
by FreeBeerReekingMonk (Deacon) on Apr 19, 2015 at 06:16 UTC

    comm comes to mind, but this is a perl forum, so here is a my shot at it, although it is flawed, as permutated lines do not get registered as a difference. Nor do extra repeated lines:

    perl -ne '$lines++;$common++ if ($seen{$_} .= @ARGV) =~ /10$/; END{pri +ntf("%.2f",$common/$lines*200)}' file1.txt file2.txt

    Not sure who to give credit to... here is the source: http://www.cyberciti.biz/faq/command-to-display-lines-common-in-files/
    It extends to 3 files if you use "=~/210$/".
    Another way would be parse the output of diff with perl. But Im not getting the correct syntax for that. Edit: Thanks! Here is my own spin:

    perl -ne '$lines++;$common++ if ($seen{$_} .= @ARGV) eq "10"; END{prin +tf("%.2f",$common/$lines*200)}' file1.txt file2.txt

      Not sure who to give credit to... here is the source

      Credit goes to mu, the source says.

      Cheers, Sören

      Créateur des bugs mobiles - let loose once, run everywhere.
      (hooked on the Perl Programming language)

Re: Similarity measurement
by oiskuu (Hermit) on Apr 20, 2015 at 05:39 UTC

    The question is vague as to the intent and scope of the problem. The best I could surmise is this might relate to plagiarism detection. How literally do you mean the "based on word content"?

    One rudimentary approach is to try compress the text units first separately, then together (as a "solid archive"), for an estimate of entropy.

    #! /usr/bin/perl use strict; use warnings; package PerlIO::via::Count { sub PUSHED { bless [], shift } sub WRITE { $_[0][0] += length $_[1]; length $_[1] } sub TELL { $_[0][0] } } use IO::Compress::Lzma; sub zce { my ($s, $t) = map { open my $cnt, ">:via(Count)", \my $null; [ $cnt, new IO::Compress::Lzma $cnt ] } 0, 0; for my $file (@_) { local $/ = \8192; open my $fh, '<', $file; while (<$fh>) { $s->[1]->write($_); $t->[1]->write($_); } $t->[1]->newStream; } $_ = ( $_->[1]->close, $_->[0]->tell ) for ($s, $t); printf "%3.0f%% %s\n", 100.0 * (1-$s/$t) / (1-1/@_), join ' ', @_; } (@ARGV = grep -f, @ARGV) > 1 or exit; # zce(@ARGV[$_-1,$_]) for 1..$#ARGV; zce(@ARGV);
    Just something to toy with. There might be a better way to get at the compressed size.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1123881]
Approved by hdb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2024-04-26 01:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found