zulqernain has asked for the wisdom of the Perl Monks concerning the following question:

hi i am trying to remove the stopwords from a text and i am usiing
\ba\b|\babout\b|\babove\b|\bacross\b|\bafter\b|\bafterwards\b
etc but some of the words are matched but some are not i dont know why

Replies are listed 'Best First'.
Re: removing stop words
by moot (Chaplain) on May 29, 2005 at 00:33 UTC
    Post your text, or a sample of it. Also your regexp can be written more succintly as:
    \b(?:a|about|above|across|after|afterwards)\b

      Also, if those are your actual stopworks (and I suspect they likely aren't, but anyway) then you can optimize thusly:

      \b(?:a(?:bout|bove|cross|fter|fterwards|))\b

      ...though perhaps it comes at the expense of readability/maintainability. If you regex is generated automatically from a stoplist, then some slightly slower regex generation code might produce a much faster regex. YMMV.

        Emacs has some wicked functions for turning lists of words into optimised regexp queries.

        In fact the list given above would condense much further as about and above both share the abo prefix, and after and afterwards both share the after prefix, etc..

        If you execute the following elisp command in Emacs:
        (regexp-opt '("a" "about" "above" "across" "after" "afterwards"))

        You get:
        "a\\(?:bo\\(?:ut\\|ve\\)\\|cross\\|fter\\(?:wards\\)?\\)?"

        ..which should be a much more efficient search expression.

        Emacs uses the double backslash to escape characters.. so in Perl the same optimised regexp looks like this:
        "a(?:bo(?:ut|ve)|cross|fter(?:wards)?)?"

        Has anyone written an equivalent module in perl to optimise list searches in regexps the way Emacs has in LISP?

        taking it further...
        /\ba(?:bo(?:ut|ve)|cross|fter(?:wards)?)?\b/
Re: removing stop words
by graff (Chancellor) on May 29, 2005 at 00:58 UTC
    Let's suppose you have your list of stop words in a plain text file -- this would be handy, in case you decide you want to lengthen or shorten the list now and then, because you won't need to modify your script if it's done something like this (updated to add a bit more commentary):
    open( LIST, "mystopwords.txt" ) or die "$!"; my @stopwords = <LIST>; # assuming one stop word per line close LIST; chomp @stopwords; my $stopregex = join '|', @stopwords; # ... now, when you go to delete stopwords from $_, # it goes like this: s/\b(?:$stopregex)\b//g;
    I presume you are involved in some process that removes punctuation as well. If you're not, then removal of just the stopwords will leave behind some odd patterns (e.g. if input includes things like "about-face", "morning-after pill", "man-about-town", and so on).

    (You didn't mention whether you were using the "g" modifier when removing the stop words. Could that have been your problem?)

    Another update: the regex approach works fine and might even be optimal, but there's another way, of course:

    my %stopwd; open( LIST, "my_stopwords.txt" ) or die "$!"; while (<LIST>) { chomp; $stopwd{$_} = undef; # assume one word per line } close LIST; # now, to remove stopwords, split the input data ($_) on \b # and check each token: my $filtered = join '', map { exists($stopwd{$_}) ? '':$_ } split /\b/ +;
Re: removing stop words
by tlm (Prior) on May 29, 2005 at 00:35 UTC

    It works fine for me:

    use strict; use warnings; my $s = 'a foo about foo above foo across foo after foo afterwards'; $s =~ s/\ba\b|\babout\b|\babove\b|\bacross\b|\bafter\b|\bafterwards\b/ +/g; print "$s\n"; __END__ foo foo foo foo foo

    BTW, you can tighten that regexp without loss of generality:

    $s =~ s/\b(?:a|about|above|across|after|afterwards)\b//g;

    Update: Thanks to graff for reminding me that the capture was not necessary. Added the ?: bit.

    the lowliest monk