comment on

OK, you may need to adjust the definition of $total as indicated somewhere; otherwise following program works as i understood your problem...

#! /usr/local/bin/perl -w

use strict ;

my $stopfile = 'stopwords';
my %stoplist;

#  fill stop word list assuming each word is on one line
open STOP, "<$stopfile"
  or die "cannot open $stopfile: $!\n";

while ( defined (my $stop = <STOP>) )
{
  chomp $stop;
  $stoplist{$stop} = 1;
}

close STOP or die "cannot close $stopfile: $!\n";

#  FIRST file contains the words to compare against,
#  get the target word list
#
my @target = @{ filter( \%stoplist , [ shift @ARGV ] ) };

#  rest of the files contain words which we want
#  to compare against the target list
#
my @words = @{ filter( \%stoplist , \@ARGV ) };

#  adjust as desired as i fail to see what is @D1 (in OP) and
#  why $total needs to be the twice the size of @D1
#
#  BELOW IS MY NOTION OF $total
#
my $total = scalar @target + scalar @words;

my $similarity =
  2 * ( scalar @{ intersect( \@target , \@words ) }
        / $total
      );

#  display similarity upto 4 decimal places
printf "\nsimilarity is: %0.4g\n\n", $similarity;


#  find intersection of two arrays: 1st contains all the interesting v
+alues,
#  2d both interesting & uninteresting
sub intersect
{
  my ($ref , $misc) = @_;

  my %intersection;

  foreach my $misc ( @{$misc} )
  {
    foreach my $ref ( @{$ref} )
    {
      next if $misc ne $ref;
      $intersection{$ref} = 1;
    }
  }

  return [ keys %intersection ];
}


#  given a stop word hash & file name array (consisting of input word 
+list),
#  return the word list that are not stop words
sub filter
{
  my ($stop , $files) = @_;

  my %filtered;

  foreach my $file ( @{$files} )
  {
    open FH , "<$file"
      or die "cannot open $file to read: $!\n";

    while ( defined (my $line = <FH>) )
    {
        foreach my $word (@{ line2words( $line ) })
        {
          next if $stop->{$word};
          $filtered{$word} = 1;
        }
    }

    close FH or die "cannot close $file: $!\n";
  }

  return [ keys %filtered ];
}

#  return words, lower cased, from a given line
sub line2words
{
  my $line = $_[0];

  return
    [ map { lc $_ }
        grep { $_ ne '' }
          split /\W+/ , $line
    ];
}
[download]

Update: Add missing die if cannot close STOP.

In reply to Re: Re: Re: Calculating "similarity" by parv
in thread Calculating "similarity" by Anonymous Monk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Do you know where your variables are?
	PerlMonks