Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Counting word frequency after StopWords removal

by cinnamond (Initiate)
on Dec 03, 2022 at 18:21 UTC ( #11148535=perlquestion: print w/replies, xml ) Need Help??

cinnamond has asked for the wisdom of the Perl Monks concerning the following question:

I have this code which reads words from input file, then it should remove all stop words and count words which aren't stop ones. When I have output, it shows me strange result like this:

hello 1 welcome 1 world 1 hello 1 our 1 page 1 to 1 welcome 2 world 1

What am I doing wrong here and how I am suppose to change this code to work properly? There is my code down below.

#!/usr/bin/perl use strict; use warnings; use Lingua::StopWords qw(getStopWords); print "Enter the name of your input file: "; chomp( my $file = <STDIN> ); my %found; open my $fh, '>', 'output2.csv' or die "Can't open this file: $!"; open my $fh2, '<', $file or die "Can't open this file: $!"; my $stopwords = getStopWords('en'); while (my $line = <$fh2>) { my @words_all = split /\s+/, $line; $found{$_}++ foreach split /\s+?/, $line; my @words_nostop = grep { !$stopwords->{$_} } @words_all; #print {$fh} join( ' ', @words_nostop ), "\n"; print $fh $_, "\t\t", $found{$_}, $/ foreach sort keys %found; } close $fh2 or die "Can't close file: $!"; close $fh or die "Can't close file: $!";

Replies are listed 'Best First'.
Re: Counting word frequency after StopWords removal
by hv (Prior) on Dec 03, 2022 at 18:43 UTC

    You don't say what input you provided, or what you expected the output to be, but I can make an educated guess.

    There are two main problems in the code. First, you count all words found in %found, not the non-stop words. Second, %found accumulates data for the whole of the file, but you print its entire contents after processing each line.

    Additional minor problems are that you count words found by splitting on /\s+?/, which will "find" empty strings if the text has multiple consecutive whitespace characters; and you do not control the case of words, so for example "the" and "The" will be treated as distinct words (and presumably at most one will be seen as a stop word).

    Guessing that Lingua::StopWords provides lower-case words, I think the core loop should look something like this (untested):

    while (my $line = <$fh2>) { ++$found{$_} for grep { !$stopwords->{$_} } split /\s+/, lc $line; } print $fh $_, "\t\t", $found{$_}, $/ for sort keys %found;
Re: Counting word frequency after StopWords removal
by kcott (Archbishop) on Dec 04, 2022 at 01:17 UTC

    G'day cinnamond,

    Welcome to the Monastery.

    [Aside: I see that it's already been pointed out that you've missed providing us with some details; the guidelines in "How do I post a question effectively?" have some more information about that. Otherwise, a well-presented first post: thankyou. Also, although it's not often directly related to the problem at hand, telling us your O/S and Perl version can result in a better answer from us (e.g. we might suggest a better, more recent Perl feature if we know you have a version that supports it).]

    Your code for getting the input file is fine. You might consider adding a second question to get an output filename. I put together what I considered a fairly challenging input file (lots of edge/corner cases) and hard-coded the filename.

    $ cat test_input.txt Hello, world! I said, "Hello, world!". Did he say "Hello, world!"? We're not sure. 1tab: 2tabs: 3tabs: END-TABS # multi-spacing here - blank line next The cat sat on the mat. Old pronouns: thou; thee; thy; thine. New pronouns: "u", 'ur'. Forecastle = "fo'c'sle" or "fo'c's'le" Forecastle = 'fo'c'sle' or 'fo'c's'le' Don't hide the Very pistol; it could be very important. Why exclude different but include same?

    I hadn't used Lingua::StopWords previously so I read the documentation. To be honest, I found it lacking in a number of respects: you can't add new stopwords; you can't remove existing stopwords that you don't want; you either specify a UTF-8 encoding or take whatever they give you. Take a look at the various language plugins in the Lingua-StopWords distribution if you haven't already done so. I added a _mod_stops() routine (in the code below) to address some of those issues; you can modify/extend that if you have other requirements.

    Working with CSV files has many gotchas: how to handle a field containing the separator character; how to quote a field containing a quote character; and so on. Writing your own code for this, unless as an academic exercise, is ill-advised. The Text::CSV module is robust, thoroughly tested, and addresses these issues: I strongly recommend you use it. It runs faster if you also have Text::CSV_XS installed, but that's optional. If you make a mistake like trying to use a string, instead of a single character, as a separator (as you did in your posted code) it will tell you about it. In the script below, I've included code to use Text::CSV: as you can see, it's very straightforward.

    Your example code shows using 'en' (English); I don't know if you have requirements for other languages. I hard-coded $lang but created a lookup table for language regexes, %word_re_for. That shows you some options; adapt according to your needs.

    I split the I/O parts of the code into two anonymous blocks. This means that filehandles are only open for the time they're needed. Perl automatically closes them at the end of those blocks: no need for close() statements. Perl also does the I/O exception handling for you via the autodie pragma: no need for '... or die "Can't whatever: $!";' all over the place.

    I'll also just mention that fc() is preferred over uc() and lc() when canonicalising strings for comparison. It requires Perl v5.16 — not knowing your Perl version, I didn't use it. (Refer back to the "Aside" at the top.)

    Here's the code.

    #!/usr/bin/env perl use strict; use warnings; use autodie; use Lingua::StopWords 'getStopWords'; use Text::CSV; my ($lang, $encoding) = qw{en UTF-8}; my %word_re_for = ( en => qr{^.*?\b([\p{Alnum}']*[\p{Alnum}]+).*$}, ); my ($in_file, $out_file) = qw{test_input.txt test_output.csv}; my $is_stop = _mod_stops(getStopWords($lang, $encoding)); my %count_for; { open my $fh, '<:encoding(UTF-8)', $in_file; while (<$fh>) { TOKEN: for my $token (split) { next TOKEN unless $token =~ $word_re_for{$lang}; my $word = lc $1; next TOKEN if $is_stop->{$word}; ++$count_for{$word}; } } } { open my $fh, '>:encoding(UTF-8)', $out_file; my $csv = Text::CSV::->new({sep_char => "\t", binary => 1}); $csv->say($fh, [$_, $count_for{$_}]) for sort keys %count_for; } sub _mod_stops { my ($stops) = @_; my @adds = qw{thou thee thy thine u ur}; my @dels = qw{very same}; $stops->{$_} = 1 for @adds; delete @$stops{@dels}; return $stops; }

    Here's the output.

    $ cat test_output.csv 1tab 1 2tabs 1 3tabs 1 blank 1 cat 1 different 1 end 1 exclude 1 fo'c's'le 2 fo'c'sle 2 forecastle 2 hello 3 hide 1 important 1 include 1 line 1 mat 1 multi 1 new 1 next 1 old 1 pistol 1 pronouns 2 said 1 same 1 sat 1 say 1 sure 1 very 2 world 3

    — Ken

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11148535]
Approved by davies
Front-paged by kcott
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2023-02-04 01:50 GMT
Find Nodes?
    Voting Booth?
    I prefer not to run the latest version of Perl because:

    Results (30 votes). Check out past polls.