Also, if those are your actual stopworks (and I suspect they likely aren't, but anyway) then you can optimize thusly:
\b(?:a(?:bout|bove|cross|fter|fterwards|))\b
...though perhaps it comes at the expense of readability/maintainability. If you regex is generated automatically from a stoplist, then some slightly slower regex generation code might produce a much faster regex. YMMV.
some sample code:
use constant PREFIXSIZE => 1; my @stopwords = qw( a about above across after afterwards ); # presumably much larger, with other first letters... my $curfirst; my %buildhash = (); # create first => rest mapping: for ( @stopwords ) { my $first = substr( $_, PREFIXSIZE, 1, "" ); push @{$buildhash{$first}}, $_; } # use letter hash to build regex my $regex = ''; for ( sort keys %buildhash ) { $regex .= "(?:\\b$_"; if ( @{$buildhash{$_}} > 1 ) { $regex .= "(?:" . join( '|', @{$buildhash{$_}} ) . ")\\b)|"; } else { $regex .= ${$buildhash{$_}}[0] . "\\b)|"; } } # ditch trailing pipe substr( $regex, -1, 1, '' ); print $regex, "\n"; __END__ prints: \b(?:a(?:|bout|bove|cross|fter|fterwards))\b
If your stoplist is large, then you trim your alternations massively. Since you run each alternation against each word, this can be -very- worthwhile.
In reply to Re^2: removing stop words
by fishbot_v2
in thread removing stop words
by zulqernain
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |