in reply to Re: removing stop words
in thread removing stop words
Also, if those are your actual stopworks (and I suspect they likely aren't, but anyway) then you can optimize thusly:
\b(?:a(?:bout|bove|cross|fter|fterwards|))\b
...though perhaps it comes at the expense of readability/maintainability. If you regex is generated automatically from a stoplist, then some slightly slower regex generation code might produce a much faster regex. YMMV.
some sample code:
use constant PREFIXSIZE => 1; my @stopwords = qw( a about above across after afterwards ); # presumably much larger, with other first letters... my $curfirst; my %buildhash = (); # create first => rest mapping: for ( @stopwords ) { my $first = substr( $_, PREFIXSIZE, 1, "" ); push @{$buildhash{$first}}, $_; } # use letter hash to build regex my $regex = ''; for ( sort keys %buildhash ) { $regex .= "(?:\\b$_"; if ( @{$buildhash{$_}} > 1 ) { $regex .= "(?:" . join( '|', @{$buildhash{$_}} ) . ")\\b)|"; } else { $regex .= ${$buildhash{$_}}[0] . "\\b)|"; } } # ditch trailing pipe substr( $regex, -1, 1, '' ); print $regex, "\n"; __END__ prints: \b(?:a(?:|bout|bove|cross|fter|fterwards))\b
If your stoplist is large, then you trim your alternations massively. Since you run each alternation against each word, this can be -very- worthwhile.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: removing stop words
by monarch (Priest) on May 29, 2005 at 13:01 UTC | |
by fishbot_v2 (Chaplain) on May 29, 2005 at 15:34 UTC | |
|
Re^3: removing stop words
by salva (Canon) on May 29, 2005 at 13:59 UTC |