Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Extracting common words

by Anonymous Monk
on Oct 22, 2015 at 01:19 UTC ( [id://1145607]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to extract common words from two files, which are to be specified on the command line. The print out is to include the list of common words and the count of how many common words were found. Non-characters must be removed. Here is what I have so far, which is not working....

my $f1 = shift; my $f2 = shift; if (! defined($f1) or ! defined($f2)) { die "Need two text file names as arguments. \n"; } my %results; open my $file1, '<', $f1; while (my $line = <$file1>) { $line =~ s/[[:punct:]]//g; for my $word (split(/\s+/,$line)) { $word =~ s/[^A-Za-z0-9]//g; $results{lc $word} = 1; } } my @words2; my @storage; open my $file2, '<', $f2; while (my $line = <$file2>) { $line =~ s/[[:punct:]]/ /g; @words2 = grep { /\S/ } split(/ /,$line); for (my $i=0; $i<scalar @words2; $i++){ $words2[$i] = lc($words2[$i]); $words2[$i] =~ s/[^A-Za-z0-9]//g; push(@storage, $words2[$i]); if (grep {$_ eq $words2[$i]} @storage[0..$#storage-1]){ $results{$words2[$i]} = 1; }else{ $results{$words2[$i]}++; } } } my $counter = 0; foreach my $words (sort { $results{$b} <=> $results{$a} } keys %result +s) { if ($results{$words} > 1){ $counter = $counter+1; print $words, "\n\n" ; } } printf "Found %1.0f words in common\n", $counter;
Any help would be appreciated. Thank you.

Replies are listed 'Best First'.
Re: Extracting common words
by Athanasius (Archbishop) on Oct 22, 2015 at 03:41 UTC

    The first while loop is working correctly: it extracts the words from the first text file, removes non-alphanumeric characters, converts the words to lower case, and stores them as the keys in a hash. (Here, as often in Perl, the value stored with the key is irrelevant.) So far, so good. (Note: the line $line =~ s/[[:punct:]]//g; isn’t needed, because all punctuation characters are removed in the subsequent substitution: $word =~ s/[^A-Za-z0-9]//g;.)

    The second while loop is more complicated, and that’s where things fall apart. I don’t think I follow the intended logic here; but, rather than try to fix it, it will be simpler — and clearer — to re-think the algorithm.

    The easiest approach is to simply repeat the logic of the first while loop and create a second hash containing the words in the second text file. (It will help if you rename the first hash %results to something like %words1, then the second hash can be named %words2.) You now have only to find which keys are common to both hashes, and that will give you the desired result:

    my $counter = 0; for my $key (sort keys %words1) { if (exists $words2{$key}) { ++$counter; print $key, "\n\n"; } } print "Found $counter words in common\n";

    You may also benefit from studying the FAQs in Data: Hashes (Associative Arrays).

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      The easiest approach is to simply repeat the logic of the first while loop and create a second hash containing the words in the second text file. (It will help if you rename the first hash %results to something like %words1, then the second hash can be named %words2.) You now have only to find which keys are common to both hashes, and that will give you the desired result:

      This is good advice; there's also List::Compare::Functional, which may make it easier to get the intersection of the two, e.g. (untested):

      use feature qw/say/; use List::Compare::Functional qw/get_intersection/; # ... my @common = get_intersection( [keys %words1], [keys %words2] ); my $counter = scalar @common; # output common words say join "\n\n", @common; # output count; say "Found $counter words in common."

      EDIT: of course it's $counter, not $common; thanks Athanasius. (See what I meant about "untested"?)

Re: Extracting common words
by GrandFather (Saint) on Oct 22, 2015 at 04:03 UTC

    Reworking your code slightly to use "built in" files results are as expected:

    #!/usr/bin/perl use strict; use warnings; =pod Removed original file name code to make sample self contained. my $f1 = shift; my $f2 = shift; if (! defined($f1) or ! defined($f2)) { die "Need two text file names as arguments. \n"; } =cut my $file1Content = <<CONTENT; red green blue red orange CONTENT my $file2Content = <<CONTENT; yellow orange red grey purple CONTENT my %results; open my $file1, '<', \$file1Content; while (my $line = <$file1>) { $line =~ s/[[:punct:]]//g; for my $word (split(/\s+/, $line)) { $word =~ s/[^A-Za-z0-9]//g; $results{lc $word} = 1; } } my @words2; my @storage; open my $file2, '<', \$file2Content; while (my $line = <$file2>) { $line =~ s/[[:punct:]]/ /g; @words2 = grep {/\S/} split(/ /, $line); for (my $i = 0; $i < scalar @words2; $i++) { $words2[$i] = lc($words2[$i]); $words2[$i] =~ s/[^A-Za-z0-9]//g; push(@storage, $words2[$i]); if (grep {$_ eq $words2[$i]} @storage[0 .. $#storage - 1]) { $results{$words2[$i]} = 1; } else { $results{$words2[$i]}++; } } } my $counter = 0; foreach my $words (sort {$results{$b} <=> $results{$a}} keys %results) + { if ($results{$words} > 1) { $counter = $counter + 1; print $words, "\n\n"; } } printf "Found %1.0f words in common\n", $counter;

    Prints:

    orange red Found 2 words in common

    Maybe you can provide "file contents" that fail in the way you didn't describe?

    Of course, the code can be cleaned up a little:

    #!/usr/bin/perl use strict; use warnings; my $file1Content = <<CONTENT; red green blue red orange CONTENT my $file2Content = <<CONTENT; yellow orange red grey purple CONTENT my %group1; open my $file1, '<', \$file1Content; while (my $line = <$file1>) { my @words = map {lc} grep {$_} split /[\W\d]+/, $line; $group1{$_} = $_ for @words; } my %common; open my $file2, '<', \$file2Content; while (my $line = <$file2>) { my @words = map {lc} grep {/\S/} split /[\W\d]+/, $line; $common{$_} = $_ for grep {exists $group1{$_}} @words; } print "$_\n\n" for sort values %common; printf "Found %1.0f words in common\n", scalar keys %common;
    Premature optimization is the root of all job security
Re: Extracting common words
by Anonymous Monk on Oct 22, 2015 at 05:15 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1145607 use File::Slurp; use strict; use warnings; my $f1 = shift; my $f2 = shift; my (%hash1, %hash2); $hash1{$_} = 1 for lc(read_file $f1) =~ /[A-Za-z0-9]+/g; $hash2{$_} = 1 for lc(read_file $f2) =~ /[A-Za-z0-9]+/g; my $count = my @common = sort grep $hash2{$_}, keys %hash1; print "count: $count\n@common\n";

      $script =~ s/A-Z//g;

Re: Extracting common words
by Anonymous Monk on Oct 22, 2015 at 02:05 UTC
Re: Extracting common words
by Old_Gray_Bear (Bishop) on Oct 22, 2015 at 22:41 UTC
    In Pseudo-Code:
    open file1 loop over file 1 extract each $word $results{$word} = 1 end-loop close file1 open file2 loop over file2 extract each $word $results{$word" += 2 #### Note: '+=', not '=' end-loop close file2 ## ### %results contains all of the words in both files. ### Words that appear in both files will have a value of '3'. ### Iterate over the %results keys and print only the ### when the value is '3'. ##

    ----
    I Go Back to Sleep, Now.

    OGB

      ## ### %results contains all of the words in both files. ### Words that appear in both files will have a value of '3'. ### Iterate over the %results keys and print only the ### when the value is '3'. ##

      If a word appears once in the first file and twice in the second, the value will 5; if it appears 3 times in the second file ...


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
      $results{$word" |= 2 #### Note: '|=', not '+=' and not '='

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1145607]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2024-04-19 12:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found