IB2017 has asked for the wisdom of the Perl Monks concerning the following question:
Hello
I used to check if a work needs to be exluded from processing checking if it is contained in a stop words list. I used this method:
my $CkDiscardCommonwords=1;#check if use stopwords or not my $term="word"; my $commonwordsRX = loadCommonWords (); if ($CkDiscardCommonwords eq 1){ if ($term =~ /^(?:$commonwordsRX)$/){ return (0); } } sub loadCommonWords { my @commonwords; my $filename="commonWords.txt"; if (open $FH, "<:encoding(UTF-8)", $filename) { while (my $line = <$FH>) { chomp $line; push @commonwords, $line; } close $FH; } my $commonwordsRX = join "|", map quotemeta, @commonwords; return $commonwordsRX; }
Now my sooftware has changed and the list of common words saved in commonWords.txt may grow exponencially. It used to be small (~300 words), now it could reach x-thousands.
I would like to hear what expert monks think about this implementation. Would a Regex constructed in this way cause problems when it grows? Should I choose another approach?
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Filtering out stop words (updated)
by haukex (Archbishop) on Feb 25, 2020 at 10:01 UTC | |
by IB2017 (Pilgrim) on Feb 25, 2020 at 10:26 UTC | |
Re: Filtering out stop words
by Eily (Monsignor) on Feb 25, 2020 at 10:09 UTC | |
Re: Filtering out stop words
by bliako (Monsignor) on Feb 25, 2020 at 12:29 UTC | |
Re: Filtering out stop words
by talexb (Chancellor) on Feb 25, 2020 at 13:17 UTC | |
by Fletch (Bishop) on Feb 25, 2020 at 15:27 UTC | |
Re: Filtering out stop words
by Eily (Monsignor) on Feb 25, 2020 at 14:04 UTC | |
Re: Filtering out stop words
by Ea (Chaplain) on Feb 26, 2020 at 09:45 UTC |
Back to
Seekers of Perl Wisdom