in reply to removing stop words

Post your text, or a sample of it. Also your regexp can be written more succintly as:
\b(?:a|about|above|across|after|afterwards)\b

Replies are listed 'Best First'.
Re^2: removing stop words
by fishbot_v2 (Chaplain) on May 29, 2005 at 00:40 UTC

    Also, if those are your actual stopworks (and I suspect they likely aren't, but anyway) then you can optimize thusly:

    \b(?:a(?:bout|bove|cross|fter|fterwards|))\b

    ...though perhaps it comes at the expense of readability/maintainability. If you regex is generated automatically from a stoplist, then some slightly slower regex generation code might produce a much faster regex. YMMV.

      Emacs has some wicked functions for turning lists of words into optimised regexp queries.

      In fact the list given above would condense much further as about and above both share the abo prefix, and after and afterwards both share the after prefix, etc..

      If you execute the following elisp command in Emacs:
      (regexp-opt '("a" "about" "above" "across" "after" "afterwards"))

      You get:
      "a\\(?:bo\\(?:ut\\|ve\\)\\|cross\\|fter\\(?:wards\\)?\\)?"

      ..which should be a much more efficient search expression.

      Emacs uses the double backslash to escape characters.. so in Perl the same optimised regexp looks like this:
      "a(?:bo(?:ut|ve)|cross|fter(?:wards)?)?"

      Has anyone written an equivalent module in perl to optimise list searches in regexps the way Emacs has in LISP?

        Yes - Jarkko Hietaniemi's Regex::PreSuf does just that.

        my $re = presuf( qw{ a about above across after afterwards } ); # yields: a(?:bo(?:ut|ve)|cross|fter(?:wards)?)?

        If we assume you aren't incurring the cost of building the regex each time (possibly you keep a stopwords file and stopreg file and rebuild the latter from the former when the former changes, or simply stat and rebuild from the main program...) then you get a significant savings:

        Rate reg pre presuf reg 33.1/s -- -34% -59% pre1 50.4/s 53% -- -37% presuf 80.6/s 144% 60% --

        pre1 is my simple algorithm from upthread, reg is a straight alternation, and presuf is presuf(). I used the english stoplist from Lingua::EN::StopWords (about 200 words) and a 4000 word text.

      taking it further...
      /\ba(?:bo(?:ut|ve)|cross|fter(?:wards)?)?\b/