removing stop words

zulqernain has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: removing stop words by moot (Chaplain) on May 29, 2005 at 00:33 UTC
Post your text, or a sample of it. Also your regexp can be written more succintly as: `\b(?:a\|about\|above\|across\|after\|afterwards)\b` [download]	[reply] [d/l]
Re^2: removing stop words by fishbot_v2 (Chaplain) on May 29, 2005 at 00:40 UTC
Also, if those are your actual stopworks (and I suspect they likely aren't, but anyway) then you can optimize thusly: `\b(?:a(?:bout\|bove\|cross\|fter\|fterwards\|))\b` [download] ...though perhaps it comes at the expense of readability/maintainability. If you regex is generated automatically from a stoplist, then some slightly slower regex generation code might produce a much faster regex. YMMV. Read more... (1266 Bytes)	[reply] [d/l] [select]
Re^3: removing stop words by monarch (Priest) on May 29, 2005 at 13:01 UTC
Emacs has some wicked functions for turning lists of words into optimised regexp queries. In fact the list given above would condense much further as about and above both share the abo prefix, and after and afterwards both share the after prefix, etc.. If you execute the following elisp command in Emacs: `(regexp-opt '("a" "about" "above" "across" "after" "afterwards"))` You get: `"a\$?:bo\\(?:ut\\\|ve\$\\\|cross\\\|fter\$?:wards\$?\\)?"` ..which should be a much more efficient search expression. Emacs uses the double backslash to escape characters.. so in Perl the same optimised regexp looks like this: `"a(?:bo(?:ut\|ve)\|cross\|fter(?:wards)?)?"` Has anyone written an equivalent module in perl to optimise list searches in regexps the way Emacs has in LISP?	[reply] [d/l] [select]
Re^4: removing stop words by fishbot_v2 (Chaplain) on May 29, 2005 at 15:34 UTC
Re^3: removing stop words by salva (Canon) on May 29, 2005 at 13:59 UTC
taking it further... `/\ba(?:bo(?:ut\|ve)\|cross\|fter(?:wards)?)?\b/` [download]	[reply] [d/l]
Re: removing stop words by graff (Chancellor) on May 29, 2005 at 00:58 UTC
Let's suppose you have your list of stop words in a plain text file -- this would be handy, in case you decide you want to lengthen or shorten the list now and then, because you won't need to modify your script if it's done something like this (updated to add a bit more commentary): `open( LIST, "mystopwords.txt" ) or die "$!"; my @stopwords = <LIST>; # assuming one stop word per line close LIST; chomp @stopwords; my $stopregex = join '\|', @stopwords; # ... now, when you go to delete stopwords from $_, # it goes like this: s/\b(?:$stopregex)\b//g;` [download] I presume you are involved in some process that removes punctuation as well. If you're not, then removal of just the stopwords will leave behind some odd patterns (e.g. if input includes things like "about-face", "morning-after pill", "man-about-town", and so on). (You didn't mention whether you were using the "g" modifier when removing the stop words. Could that have been your problem?) Another update: the regex approach works fine and might even be optimal, but there's another way, of course: `my %stopwd; open( LIST, "my_stopwords.txt" ) or die "$!"; while (<LIST>) { chomp; $stopwd{$_} = undef; # assume one word per line } close LIST; # now, to remove stopwords, split the input data ($_) on \b # and check each token: my $filtered = join '', map { exists($stopwd{$_}) ? '':$_ } split /\b/ +;` [download]	[reply] [d/l] [select]
Re: removing stop words by tlm (Prior) on May 29, 2005 at 00:35 UTC
It works fine for me: `use strict; use warnings; my $s = 'a foo about foo above foo across foo after foo afterwards'; $s =~ s/\ba\b\|\babout\b\|\babove\b\|\bacross\b\|\bafter\b\|\bafterwards\b/ +/g; print "$s\n"; __END__ foo foo foo foo foo` [download] BTW, you can tighten that regexp without loss of generality: `$s =~ s/\b(?:a\|about\|above\|across\|after\|afterwards)\b//g;` [download] Update: Thanks to graff for reminding me that the capture was not necessary. Added the `?:` bit. the lowliest monk	[reply] [d/l] [select]