comment on

As the other monks said, with the regex method, you have to check a $term against each stopword until a match is found. If no match found, you end up checking with ALL stopwords. The benefit is that it offers you non-exact matching, useful in checking against all the variations of a stopword.

On the other hand you have a hashtable which will offer you only exact matching (therefore if you have variations they will have to be entered each as individual stopwords). But fetching is O(1). Constructing can be expensive but once you construct it you perhaps can serialise it and save to to disk for later use.

The approach I am suggesting is using a binary tree to store your exact stopwords and all their variations if any. In this way you will do at most as many string comparisons as the height of the tree. And what's that? Little if your stopwords can make a nice balanced tree or not. This data structure can also be serialised and saved to disk.

Here is some code to get you started:

# author: bliako
# date:   25/02/2020
# for:    https://perlmonks.org/?node_id=11113388

use strict;
use warnings;

use AVLTree;

# create a binary tree with custom string comparator
# if cmp < 0, word goes to the left of current node
# if cmp > 0, word goes to the right
# if cmp == 0, word is identical to the current node
my $tree = AVLTree->new(sub { $_[0] cmp $_[1] });

open(my $IN, '<', '/usr/share/dict/words');

my $tstart = time;
my $numwords = 0;
while( my $aword=<$IN> ){
    chomp $aword;
    $tree->insert($aword);
    $numwords++;
}
close($IN);
print "$0 : tree size: ".$tree->size()." (same as $numwords in file).\
+n";
print "$0 : inserted $numwords words in ".(time-$tstart)." seconds.\n"
+;

open($IN, '<', '/usr/share/dict/words'); 
$numwords = 0;
$tstart = time;
while( my $astopword=<$IN> ){
    chomp $astopword;
    next if rand>0.05;
    if( ! $tree->find($astopword) ){ print STDERR "$0 : error, did not
+ find the word '$astopword', that's not right\n"; exit(1); }
    $numwords++;
}
close($IN); # edit: forgot that!
print "search $numwords words in ".(time-$tstart)." seconds.\n";
[download]

Note that with an XS module like AVLTree it may not be possible to serialise, in which case use a pure Perl module like Tree::Binary.

Later edit: just to clarify that a binary tree is one in which each node has a maximum of 2 children. Left and Right. An AVL_tree is a binary tree which internally tries to be balanced without user interaction. What is "balanced tree"? It's one that all leaf nodes more or less have the same distance from the root node. So long branches and short branches are avoided. That gives consistent performance. etc.

AVL trees were invented in the Soviet Union in 1962. One of the two mathematicians who invented it also participated in building Kaissa, the first computer chess program to win a championship in 1974.

Another Later Edit: The benefit of using a tree over using a hashtable is that the tree's capacity can not be exhausted unless you run out of RAM. With a hashtable you are limited by the hash-key generator and other internal implementations, so practically its size may have a limit - depending on implementation.

bw, bliako

In reply to Re: Filtering out stop words by bliako
in thread Filtering out stop words by IB2017

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.