Re: Filtering out stop words

As the other monks said, with the regex method, you have to check a $term against each stopword until a match is found. If no match found, you end up checking with ALL stopwords. The benefit is that it offers you non-exact matching, useful in checking against all the variations of a stopword.

On the other hand you have a hashtable which will offer you only exact matching (therefore if you have variations they will have to be entered each as individual stopwords). But fetching is O(1). Constructing can be expensive but once you construct it you perhaps can serialise it and save to to disk for later use.

The approach I am suggesting is using a binary tree to store your exact stopwords and all their variations if any. In this way you will do at most as many string comparisons as the height of the tree. And what's that? Little if your stopwords can make a nice balanced tree or not. This data structure can also be serialised and saved to disk.

Here is some code to get you started:

# author: bliako
# date:   25/02/2020
# for:    https://perlmonks.org/?node_id=11113388

use strict;
use warnings;

use AVLTree;

# create a binary tree with custom string comparator
# if cmp < 0, word goes to the left of current node
# if cmp > 0, word goes to the right
# if cmp == 0, word is identical to the current node
my $tree = AVLTree->new(sub { $_[0] cmp $_[1] });

open(my $IN, '<', '/usr/share/dict/words');

my $tstart = time;
my $numwords = 0;
while( my $aword=<$IN> ){
    chomp $aword;
    $tree->insert($aword);
    $numwords++;
}
close($IN);
print "$0 : tree size: ".$tree->size()." (same as $numwords in file).\
+n";
print "$0 : inserted $numwords words in ".(time-$tstart)." seconds.\n"
+;

open($IN, '<', '/usr/share/dict/words'); 
$numwords = 0;
$tstart = time;
while( my $astopword=<$IN> ){
    chomp $astopword;
    next if rand>0.05;
    if( ! $tree->find($astopword) ){ print STDERR "$0 : error, did not
+ find the word '$astopword', that's not right\n"; exit(1); }
    $numwords++;
}
close($IN); # edit: forgot that!
print "search $numwords words in ".(time-$tstart)." seconds.\n";
[download]

Note that with an XS module like AVLTree it may not be possible to serialise, in which case use a pure Perl module like Tree::Binary.

Later edit: just to clarify that a binary tree is one in which each node has a maximum of 2 children. Left and Right. An AVL_tree is a binary tree which internally tries to be balanced without user interaction. What is "balanced tree"? It's one that all leaf nodes more or less have the same distance from the root node. So long branches and short branches are avoided. That gives consistent performance. etc.

AVL trees were invented in the Soviet Union in 1962. One of the two mathematicians who invented it also participated in building Kaissa, the first computer chess program to win a championship in 1974.

Another Later Edit: The benefit of using a tree over using a hashtable is that the tree's capacity can not be exhausted unless you run out of RAM. With a hashtable you are limited by the hash-key generator and other internal implementations, so practically its size may have a limit - depending on implementation.

bw, bliako

Comment on Re: Filtering out stop words Select or Download Code


good chemistry is complicated, and a little bit messy -LW
	PerlMonks