Re^2: removing stop words

Also, if those are your actual stopworks (and I suspect they likely aren't, but anyway) then you can optimize thusly:

\b(?:a(?:bout|bove|cross|fter|fterwards|))\b
[download]

...though perhaps it comes at the expense of readability/maintainability. If you regex is generated automatically from a stoplist, then some slightly slower regex generation code might produce a much faster regex. YMMV.

some sample code:

use constant PREFIXSIZE => 1;

my @stopwords = qw(
   a about above across after afterwards );
   # presumably much larger, with other first letters...

my $curfirst;
my %buildhash = ();

# create first => rest mapping:
for ( @stopwords )
{
   my $first = substr( $_, PREFIXSIZE, 1, "" );
   push @{$buildhash{$first}}, $_;
}

# use letter hash to build regex
my $regex = '';
for ( sort keys %buildhash )
{
   $regex .= "(?:\\b$_";
   if ( @{$buildhash{$_}} > 1 )
   {
      $regex .= "(?:" . 
                join( '|',  @{$buildhash{$_}} ) . ")\\b)|";
   } else {
      $regex .= ${$buildhash{$_}}[0]  . "\\b)|";
   }
}

# ditch trailing pipe
substr( $regex, -1, 1, '' );
print $regex, "\n";

__END__


prints:
\b(?:a(?:|bout|bove|cross|fter|fterwards))\b
[download]

If your stoplist is large, then you trim your alternations massively. Since you run each alternation against each word, this can be -very- worthwhile.

Comment on Re^2: removing stop words Select or Download Code

Replies are listed 'Best First'.
Re^3: removing stop words by monarch (Priest) on May 29, 2005 at 13:01 UTC
Emacs has some wicked functions for turning lists of words into optimised regexp queries. In fact the list given above would condense much further as about and above both share the abo prefix, and after and afterwards both share the after prefix, etc.. If you execute the following elisp command in Emacs: `(regexp-opt '("a" "about" "above" "across" "after" "afterwards"))` You get: `"a\$?:bo\\(?:ut\\\|ve\$\\\|cross\\\|fter\$?:wards\$?\\)?"` ..which should be a much more efficient search expression. Emacs uses the double backslash to escape characters.. so in Perl the same optimised regexp looks like this: `"a(?:bo(?:ut\|ve)\|cross\|fter(?:wards)?)?"` Has anyone written an equivalent module in perl to optimise list searches in regexps the way Emacs has in LISP?	[reply] [d/l] [select]
Re^4: removing stop words by fishbot_v2 (Chaplain) on May 29, 2005 at 15:34 UTC
Yes - Jarkko Hietaniemi's Regex::PreSuf does just that. `my $re = presuf( qw{ a about above across after afterwards } ); # yields: a(?:bo(?:ut\|ve)\|cross\|fter(?:wards)?)?` [download] If we assume you aren't incurring the cost of building the regex each time (possibly you keep a stopwords file and stopreg file and rebuild the latter from the former when the former changes, or simply stat and rebuild from the main program...) then you get a significant savings: `Rate reg pre presuf reg 33.1/s -- -34% -59% pre1 50.4/s 53% -- -37% presuf 80.6/s 144% 60% --` [download] `pre1` is my simple algorithm from upthread, `reg` is a straight alternation, and `presuf` is `presuf()`. I used the english stoplist from Lingua::EN::StopWords (about 200 words) and a 4000 word text.	[reply] [d/l] [select]
Re^3: removing stop words by salva (Canon) on May 29, 2005 at 13:59 UTC
taking it further... `/\ba(?:bo(?:ut\|ve)\|cross\|fter(?:wards)?)?\b/` [download]	[reply] [d/l]