Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Re: Counting word frequency after StopWords removal

by kcott (Archbishop)
on Dec 04, 2022 at 01:17 UTC ( #11148542=note: print w/replies, xml ) Need Help??

in reply to Counting word frequency after StopWords removal

G'day cinnamond,

Welcome to the Monastery.

[Aside: I see that it's already been pointed out that you've missed providing us with some details; the guidelines in "How do I post a question effectively?" have some more information about that. Otherwise, a well-presented first post: thankyou. Also, although it's not often directly related to the problem at hand, telling us your O/S and Perl version can result in a better answer from us (e.g. we might suggest a better, more recent Perl feature if we know you have a version that supports it).]

Your code for getting the input file is fine. You might consider adding a second question to get an output filename. I put together what I considered a fairly challenging input file (lots of edge/corner cases) and hard-coded the filename.

$ cat test_input.txt Hello, world! I said, "Hello, world!". Did he say "Hello, world!"? We're not sure. 1tab: 2tabs: 3tabs: END-TABS # multi-spacing here - blank line next The cat sat on the mat. Old pronouns: thou; thee; thy; thine. New pronouns: "u", 'ur'. Forecastle = "fo'c'sle" or "fo'c's'le" Forecastle = 'fo'c'sle' or 'fo'c's'le' Don't hide the Very pistol; it could be very important. Why exclude different but include same?

I hadn't used Lingua::StopWords previously so I read the documentation. To be honest, I found it lacking in a number of respects: you can't add new stopwords; you can't remove existing stopwords that you don't want; you either specify a UTF-8 encoding or take whatever they give you. Take a look at the various language plugins in the Lingua-StopWords distribution if you haven't already done so. I added a _mod_stops() routine (in the code below) to address some of those issues; you can modify/extend that if you have other requirements.

Working with CSV files has many gotchas: how to handle a field containing the separator character; how to quote a field containing a quote character; and so on. Writing your own code for this, unless as an academic exercise, is ill-advised. The Text::CSV module is robust, thoroughly tested, and addresses these issues: I strongly recommend you use it. It runs faster if you also have Text::CSV_XS installed, but that's optional. If you make a mistake like trying to use a string, instead of a single character, as a separator (as you did in your posted code) it will tell you about it. In the script below, I've included code to use Text::CSV: as you can see, it's very straightforward.

Your example code shows using 'en' (English); I don't know if you have requirements for other languages. I hard-coded $lang but created a lookup table for language regexes, %word_re_for. That shows you some options; adapt according to your needs.

I split the I/O parts of the code into two anonymous blocks. This means that filehandles are only open for the time they're needed. Perl automatically closes them at the end of those blocks: no need for close() statements. Perl also does the I/O exception handling for you via the autodie pragma: no need for '... or die "Can't whatever: $!";' all over the place.

I'll also just mention that fc() is preferred over uc() and lc() when canonicalising strings for comparison. It requires Perl v5.16 — not knowing your Perl version, I didn't use it. (Refer back to the "Aside" at the top.)

Here's the code.

#!/usr/bin/env perl use strict; use warnings; use autodie; use Lingua::StopWords 'getStopWords'; use Text::CSV; my ($lang, $encoding) = qw{en UTF-8}; my %word_re_for = ( en => qr{^.*?\b([\p{Alnum}']*[\p{Alnum}]+).*$}, ); my ($in_file, $out_file) = qw{test_input.txt test_output.csv}; my $is_stop = _mod_stops(getStopWords($lang, $encoding)); my %count_for; { open my $fh, '<:encoding(UTF-8)', $in_file; while (<$fh>) { TOKEN: for my $token (split) { next TOKEN unless $token =~ $word_re_for{$lang}; my $word = lc $1; next TOKEN if $is_stop->{$word}; ++$count_for{$word}; } } } { open my $fh, '>:encoding(UTF-8)', $out_file; my $csv = Text::CSV::->new({sep_char => "\t", binary => 1}); $csv->say($fh, [$_, $count_for{$_}]) for sort keys %count_for; } sub _mod_stops { my ($stops) = @_; my @adds = qw{thou thee thy thine u ur}; my @dels = qw{very same}; $stops->{$_} = 1 for @adds; delete @$stops{@dels}; return $stops; }

Here's the output.

$ cat test_output.csv 1tab 1 2tabs 1 3tabs 1 blank 1 cat 1 different 1 end 1 exclude 1 fo'c's'le 2 fo'c'sle 2 forecastle 2 hello 3 hide 1 important 1 include 1 line 1 mat 1 multi 1 new 1 next 1 old 1 pistol 1 pronouns 2 said 1 same 1 sat 1 say 1 sure 1 very 2 world 3

— Ken

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11148542]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (2)
As of 2023-02-04 03:24 GMT
Find Nodes?
    Voting Booth?
    I prefer not to run the latest version of Perl because:

    Results (30 votes). Check out past polls.