in reply to Re^2: removing stop words
in thread removing stop words

Emacs has some wicked functions for turning lists of words into optimised regexp queries.

In fact the list given above would condense much further as about and above both share the abo prefix, and after and afterwards both share the after prefix, etc..

If you execute the following elisp command in Emacs:
(regexp-opt '("a" "about" "above" "across" "after" "afterwards"))

You get:
"a\\(?:bo\\(?:ut\\|ve\\)\\|cross\\|fter\\(?:wards\\)?\\)?"

..which should be a much more efficient search expression.

Emacs uses the double backslash to escape characters.. so in Perl the same optimised regexp looks like this:
"a(?:bo(?:ut|ve)|cross|fter(?:wards)?)?"

Has anyone written an equivalent module in perl to optimise list searches in regexps the way Emacs has in LISP?

Replies are listed 'Best First'.
Re^4: removing stop words
by fishbot_v2 (Chaplain) on May 29, 2005 at 15:34 UTC

    Yes - Jarkko Hietaniemi's Regex::PreSuf does just that.

    my $re = presuf( qw{ a about above across after afterwards } ); # yields: a(?:bo(?:ut|ve)|cross|fter(?:wards)?)?

    If we assume you aren't incurring the cost of building the regex each time (possibly you keep a stopwords file and stopreg file and rebuild the latter from the former when the former changes, or simply stat and rebuild from the main program...) then you get a significant savings:

    Rate reg pre presuf reg 33.1/s -- -34% -59% pre1 50.4/s 53% -- -37% presuf 80.6/s 144% 60% --

    pre1 is my simple algorithm from upthread, reg is a straight alternation, and presuf is presuf(). I used the english stoplist from Lingua::EN::StopWords (about 200 words) and a 4000 word text.