Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Counting word frequency after StopWords removal

by hv (Prior)
on Dec 03, 2022 at 18:43 UTC ( [id://11148537]=note: print w/replies, xml ) Need Help??


in reply to Counting word frequency after StopWords removal

You don't say what input you provided, or what you expected the output to be, but I can make an educated guess.

There are two main problems in the code. First, you count all words found in %found, not the non-stop words. Second, %found accumulates data for the whole of the file, but you print its entire contents after processing each line.

Additional minor problems are that you count words found by splitting on /\s+?/, which will "find" empty strings if the text has multiple consecutive whitespace characters; and you do not control the case of words, so for example "the" and "The" will be treated as distinct words (and presumably at most one will be seen as a stop word).

Guessing that Lingua::StopWords provides lower-case words, I think the core loop should look something like this (untested):

while (my $line = <$fh2>) { ++$found{$_} for grep { !$stopwords->{$_} } split /\s+/, lc $line; } print $fh $_, "\t\t", $found{$_}, $/ for sort keys %found;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11148537]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2024-04-18 16:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found