Re: Counting word frequency after StopWords removal

in reply to Counting word frequency after StopWords removal

You don't say what input you provided, or what you expected the output to be, but I can make an educated guess.

There are two main problems in the code. First, you count all words found in %found, not the non-stop words. Second, %found accumulates data for the whole of the file, but you print its entire contents after processing each line.

Additional minor problems are that you count words found by splitting on /\s+?/, which will "find" empty strings if the text has multiple consecutive whitespace characters; and you do not control the case of words, so for example "the" and "The" will be treated as distinct words (and presumably at most one will be seen as a stop word).

Guessing that Lingua::StopWords provides lower-case words, I think the core loop should look something like this (untested):

while (my $line = <$fh2>) {
    ++$found{$_} for grep { !$stopwords->{$_} }
            split /\s+/, lc $line;
}
print $fh $_, "\t\t", $found{$_}, $/ for sort keys %found;
[download]

Comment on Re: Counting word frequency after StopWords removal Select or Download Code

In Section Seekers of Perl Wisdom