Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Calculating "similarity"

by Anonymous Monk
on Mar 02, 2003 at 01:02 UTC ( #239809=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

HI, I have some problems in doing the a perl program about similarity. See if anyone helps. thanks

Similiarity contains a formula to calculate liks this:

Similiarity = 2 x ( intersection/ total)

I tried to solve the problem, however i'm stuck in the middle. Since when i write the program, i need to run a stoplist in the program and fliter some words out from the stoplist before calculating the rest of the words in the files. The main point is to use one files and compare with the rest of the files.

However, when i was doing it, i do not know how to convert some command from hash to array or vice versa, therefore, i am stuck.

here's my script, i hope if anyone can help me.:

#! /usr/local/bin/perl -w use strict ; my $stopfile = 'stopwords'; my $base= shift @ARGV; my @files = @ARGV; my %stopwords=(); my %basefilterwords=(); my %filterwords=(); my @basewords; my @words; open STOP, "<$stopfile"; while (my $stopword =<STOP>) { chomp $stopword; $stopwords {$stopword} =1; } close STOP; open BASETEXT, "<$base"; while (my $line =<BASETEXT> ) { my @basewords = split /\W/, $line ; foreach my $baseword (@basewords) { if ($baseword ne '') { $baseword = lc $baseword ; } if ($stopwords{$baseword}) { } else { $basefilterwords{$baseword}=1; } } close BASETEXT; foreach my $file ( @ARGV ) { open TEXT, "<$file"; while (my $line =<TEXT> ) { my @words = split /\W/, $line ; foreach my $word (@words) { if ($word ne '') { $word = lc $word ; } if ($stopwords{$word}) { } else { $filterwords{$word}=1; } } close TEXT; } }
I just did until here, starting to fliter the words, then i am stuck in here since i do not know how to change the cammand into array.. here it is:
@D1 = map lc $_, $D1 =~ /(\w+)/g ; my @D2 = map lc $_, $D2 =~ /(\w+)/g ; my %D2 = () ; @D2{@D2} = (1) x scalar @D2 ; my $total = scalar @D1 + scalar @D1 ; my $intersection = 0 ; # count the number of words in common foreach my $word ( @D1 ) { ++$intersection if $D2{$word} ; } my $similarity = 2 * ( $intersection/$total ) ; print "\n$similarity\n\n" ;
I am sure that this part needs to have some changes, however, I really do not understand. I hope there has people can help me to solve it thanks.

Replies are listed 'Best First'.
Re: Calculating "similarity"
by parv (Vicar) on Mar 02, 2003 at 03:18 UTC

    ( Correction: the "Vector Space..." article is not actually directly related to the OP's problem but does deal w/ the "closeness of words" concept. In that regard, below is the corrected version of my earlier reply.

    UPDATE (Mar 3 2003): If anybody is still interested, i have rounded up some relevant things under "string munging". )

    See Vector Space Search Engine article which does some similar things.

    The String::Similarity and String::Approx modules may also be of interest. Below are the descriptions (from FreeBSD ports)...

    The "String::Similarity" calculates the similarity index of its two arguments. A value of '0' means that the strings are entirely different. A value of '1' means that the strings are identical. Everything else lies between 0 and 1 and describes the amount of similarity between the strings.

    String::Approx lets you match and substitute strings approximately. With this you can emulate errors: typing errors, spelling errors, closely related vocabularies (colour color), genetic mutations (GAG ACT), abbreviations (McScot, MacScot).

Re: Calculating "similarity"
by Hofmator (Curate) on Mar 02, 2003 at 19:54 UTC
    OK, let me rephrase your problem to see if I understood you correctly. I assume that you want to compare several (other) files to one base file. You want to calculate a similarity rating between those files based on the number of words that are same (=intersection) and the total number of words (=total) in each of the other files - this last point I'm not sure about, it could also be total number of words in base and other file together.

    Does this do what you want?

    #!/usr/local/bin/perl -w use strict ; my $stopfile = 'stopwords'; my $base= shift @ARGV; # read in stopwords open STOP, "<$stopfile" or die $!; chomp( my @stop= <STOP> ); close STOP; my %stopwords=(); # add the empty string '' to the stopwords as well @stopwords{@stop,''} = (); # read in basefile my %base_filterwords=(); open BASETEXT, "<$base" or die $!; while ( <BASETEXT> ) { @base_filterwords{ map { my $l = lc; exists $stopwords{$l}?():$l } +split /\W+/ } = (); } close BASETEXT; # read in other files my %other_filterwords=(); while ( <> ) { @other_filterwords{ map { my $l = lc; exists $stopwords{$l}?():$l } + split /\W+/ } = (); } continue { if (eof) { my $total = scalar keys %other_filterwords; my $intersect = 0; for my $key (keys %other_filterwords) { $intersect++ if exists $base_filterwords{$key}; } my $sim = 2*$intersect/$total; print "Similarity between $base and $ARGV = $sim\n"; print "\t@{[keys %other_filterwords]}\n"; print "\t@{[keys %base_filterwords]}\n"; # reset other filter words %other_filterwords = (); } }

    Update Fixed some small bugs in the code ...

    -- Hofmator

Re: Calculating "similarity"
by parv (Vicar) on Mar 02, 2003 at 04:50 UTC

    Could you please clarify...

    • what are the differences/similarities among @words & @basewords and %stopwords & %filterwords as each pair seem to have duplicates?
    • why is the $total set to twice the size of @D1 (near the end of OP)?

    ...only if you had commented the code... :(

      Basewords are the words shown in the file 1 if i am comparing five files.. and words are the rest of the words in the other 4 files. stopwords are the words shown in a stop list. indeed, i think i did it wrong and thats why i can't finish it.. can you be able to solve it?? i need to find the similarity of a file comparing with the rest of the four files...(if there are 5 files in total) . and the similarity formula are the formula used to find similiarity.. its a fixed formula.. thanks

        OK, you may need to adjust the definition of $total as indicated somewhere; otherwise following program works as i understood your problem...

        #! /usr/local/bin/perl -w use strict ; my $stopfile = 'stopwords'; my %stoplist; # fill stop word list assuming each word is on one line open STOP, "<$stopfile" or die "cannot open $stopfile: $!\n"; while ( defined (my $stop = <STOP>) ) { chomp $stop; $stoplist{$stop} = 1; } close STOP or die "cannot close $stopfile: $!\n"; # FIRST file contains the words to compare against, # get the target word list # my @target = @{ filter( \%stoplist , [ shift @ARGV ] ) }; # rest of the files contain words which we want # to compare against the target list # my @words = @{ filter( \%stoplist , \@ARGV ) }; # adjust as desired as i fail to see what is @D1 (in OP) and # why $total needs to be the twice the size of @D1 # # BELOW IS MY NOTION OF $total # my $total = scalar @target + scalar @words; my $similarity = 2 * ( scalar @{ intersect( \@target , \@words ) } / $total ); # display similarity upto 4 decimal places printf "\nsimilarity is: %0.4g\n\n", $similarity; # find intersection of two arrays: 1st contains all the interesting v +alues, # 2d both interesting & uninteresting sub intersect { my ($ref , $misc) = @_; my %intersection; foreach my $misc ( @{$misc} ) { foreach my $ref ( @{$ref} ) { next if $misc ne $ref; $intersection{$ref} = 1; } } return [ keys %intersection ]; } # given a stop word hash & file name array (consisting of input word +list), # return the word list that are not stop words sub filter { my ($stop , $files) = @_; my %filtered; foreach my $file ( @{$files} ) { open FH , "<$file" or die "cannot open $file to read: $!\n"; while ( defined (my $line = <FH>) ) { foreach my $word (@{ line2words( $line ) }) { next if $stop->{$word}; $filtered{$word} = 1; } } close FH or die "cannot close $file: $!\n"; } return [ keys %filtered ]; } # return words, lower cased, from a given line sub line2words { my $line = $_[0]; return [ map { lc $_ } grep { $_ ne '' } split /\W+/ , $line ]; }

        Update: Add missing die if cannot close STOP.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://239809]
Approved by dws
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2021-10-19 08:28 GMT
Find Nodes?
    Voting Booth?
    My first memorable Perl project was:

    Results (76 votes). Check out past polls.