in reply to Re: removing stop words
in thread removing stop words

Also, if those are your actual stopworks (and I suspect they likely aren't, but anyway) then you can optimize thusly:

\b(?:a(?:bout|bove|cross|fter|fterwards|))\b

...though perhaps it comes at the expense of readability/maintainability. If you regex is generated automatically from a stoplist, then some slightly slower regex generation code might produce a much faster regex. YMMV.

some sample code:

use constant PREFIXSIZE => 1; my @stopwords = qw( a about above across after afterwards ); # presumably much larger, with other first letters... my $curfirst; my %buildhash = (); # create first => rest mapping: for ( @stopwords ) { my $first = substr( $_, PREFIXSIZE, 1, "" ); push @{$buildhash{$first}}, $_; } # use letter hash to build regex my $regex = ''; for ( sort keys %buildhash ) { $regex .= "(?:\\b$_"; if ( @{$buildhash{$_}} > 1 ) { $regex .= "(?:" . join( '|', @{$buildhash{$_}} ) . ")\\b)|"; } else { $regex .= ${$buildhash{$_}}[0] . "\\b)|"; } } # ditch trailing pipe substr( $regex, -1, 1, '' ); print $regex, "\n"; __END__ prints: \b(?:a(?:|bout|bove|cross|fter|fterwards))\b

If your stoplist is large, then you trim your alternations massively. Since you run each alternation against each word, this can be -very- worthwhile.

Replies are listed 'Best First'.
Re^3: removing stop words
by monarch (Priest) on May 29, 2005 at 13:01 UTC
    Emacs has some wicked functions for turning lists of words into optimised regexp queries.

    In fact the list given above would condense much further as about and above both share the abo prefix, and after and afterwards both share the after prefix, etc..

    If you execute the following elisp command in Emacs:
    (regexp-opt '("a" "about" "above" "across" "after" "afterwards"))

    You get:
    "a\\(?:bo\\(?:ut\\|ve\\)\\|cross\\|fter\\(?:wards\\)?\\)?"

    ..which should be a much more efficient search expression.

    Emacs uses the double backslash to escape characters.. so in Perl the same optimised regexp looks like this:
    "a(?:bo(?:ut|ve)|cross|fter(?:wards)?)?"

    Has anyone written an equivalent module in perl to optimise list searches in regexps the way Emacs has in LISP?

      Yes - Jarkko Hietaniemi's Regex::PreSuf does just that.

      my $re = presuf( qw{ a about above across after afterwards } ); # yields: a(?:bo(?:ut|ve)|cross|fter(?:wards)?)?

      If we assume you aren't incurring the cost of building the regex each time (possibly you keep a stopwords file and stopreg file and rebuild the latter from the former when the former changes, or simply stat and rebuild from the main program...) then you get a significant savings:

      Rate reg pre presuf reg 33.1/s -- -34% -59% pre1 50.4/s 53% -- -37% presuf 80.6/s 144% 60% --

      pre1 is my simple algorithm from upthread, reg is a straight alternation, and presuf is presuf(). I used the english stoplist from Lingua::EN::StopWords (about 200 words) and a 4000 word text.

Re^3: removing stop words
by salva (Canon) on May 29, 2005 at 13:59 UTC
    taking it further...
    /\ba(?:bo(?:ut|ve)|cross|fter(?:wards)?)?\b/