Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
HI, I have some problems in doing the a perl program about similarity. See if anyone helps. thanks

Similiarity contains a formula to calculate liks this:

Similiarity = 2 x ( intersection/ total)

I tried to solve the problem, however i'm stuck in the middle. Since when i write the program, i need to run a stoplist in the program and fliter some words out from the stoplist before calculating the rest of the words in the files. The main point is to use one files and compare with the rest of the files.

However, when i was doing it, i do not know how to convert some command from hash to array or vice versa, therefore, i am stuck.

here's my script, i hope if anyone can help me.:

#! /usr/local/bin/perl -w use strict ; my $stopfile = 'stopwords'; my $base= shift @ARGV; my @files = @ARGV; my %stopwords=(); my %basefilterwords=(); my %filterwords=(); my @basewords; my @words; open STOP, "<$stopfile"; while (my $stopword =<STOP>) { chomp $stopword; $stopwords {$stopword} =1; } close STOP; open BASETEXT, "<$base"; while (my $line =<BASETEXT> ) { my @basewords = split /\W/, $line ; foreach my $baseword (@basewords) { if ($baseword ne '') { $baseword = lc $baseword ; } if ($stopwords{$baseword}) { } else { $basefilterwords{$baseword}=1; } } close BASETEXT; foreach my $file ( @ARGV ) { open TEXT, "<$file"; while (my $line =<TEXT> ) { my @words = split /\W/, $line ; foreach my $word (@words) { if ($word ne '') { $word = lc $word ; } if ($stopwords{$word}) { } else { $filterwords{$word}=1; } } close TEXT; } }
I just did until here, starting to fliter the words, then i am stuck in here since i do not know how to change the cammand into array.. here it is:
@D1 = map lc $_, $D1 =~ /(\w+)/g ; my @D2 = map lc $_, $D2 =~ /(\w+)/g ; my %D2 = () ; @D2{@D2} = (1) x scalar @D2 ; my $total = scalar @D1 + scalar @D1 ; my $intersection = 0 ; # count the number of words in common foreach my $word ( @D1 ) { ++$intersection if $D2{$word} ; } my $similarity = 2 * ( $intersection/$total ) ; print "\n$similarity\n\n" ;
I am sure that this part needs to have some changes, however, I really do not understand. I hope there has people can help me to solve it thanks.

In reply to Calculating "similarity" by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2024-04-16 20:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found