A short set of subroutines and some data for removing all the stopwords (a, for, and, not, while, etc...) from a string. Very useful in finding the juicy-bits inside a whole english phrase.

Thanx salvadors for the hints!

my @stopwords=qw( a about above according across actually adj after afterwards again against all almost alone along already also although always among amongst an and another any anyhow anyone anything anywhere are aren't around as at be became because become becomes becoming been before beforehand begin beginning behind being below beside besides between beyond billion both but by can can't cannot caption co company corp corporation could couldn't did didn't do does doesn't don't down during each eg eight eighty either else elsewhere end ending enough etc even ever every everyone everything everywhere except few fifty first five for former formerly forty found four from further had has hasn't have haven't he he'd he'll he's hence her here here's hereafter hereby herein hereupon hers herself him himself his how however hundred i i'd i'll i'm i've ie if in inc indeed instead into is isn't it it's its itself last later latter latterly least less let let's like likely ltd made make makes many maybe me meantime meanwhile might million miss more moreover most mostly mr mrs much must my myself namely neither never nevertheless next nine ninety no nobody none nonetheless noone nor not nothing now nowhere of off often on once one one's only onto or other others otherwise our ours ourselves out over overall own per perhaps rather recent recently same seem seemed seeming seems seven seventy several she she'd she'll she's should shouldn't since six sixty so some somehow someone something sometime sometimes somewhere still stop such taking ten than that that'll that's that've the their them themselves then thence there there'd there'll there're there's there've thereafter thereby therefore therein thereupon these they they'd they'll they're they've thirty this those though thousand three through throughout thru thus to together too toward towards trillion twenty two under unless unlike unlikely until up upon us used using very via ve was wasn't we we'd we'll we're we've well were weren't what what'll what's what've whatever when whence whenever where where's whereafter whereas whereby wherein whereupon wherever whether which while whither who who'd who'll who's whoever whole whom whomever whose why will with within without won't would wouldn't yeah yes yet you you'd you'll you're you've your yours yourself yourselves 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ); my %stop=map { lc $_ => 1 } @stopwords; sub findwords { my $string = shift; my (@ok, %seen); while ($string =~ /((\w|')+)/g) { push @ok, $1 unless $stop{lc $1} or $seen{lc $1}++; } return @ok; } 1;

Replies are listed 'Best First'.
Re: Removing Stopwords from a String
by salvadors (Pilgrim) on Jan 06, 2001 at 23:02 UTC

    Wow! That's a lot of regular expressions going on there...

    Personally I'd do something more akin to:

    my @stopwords = qw/ i'd add all my stop words in here /; my %stop = map { lc $_ => 1 } @stopwords; sub findwords { my $string = shift; my (@ok, %seen); while ($string =~ /((\w|')+)/g) { push @ok, $1 unless $stop{lc $1} or $seen{lc $1}++; } return @ok; }}

    My tests show this as coming out about 2 orders of magnitude faster, and it also copes better with apostrophized words that aren't in the stop list.

    Tony

      Hey How to use It?? I mean if i have an array containing whole string and i want to remove these stopwords from it then how i would use this subroutine.. Sorry I am new to PErl plz reply
Re: Removing Stopwords from a String
by bobf (Monsignor) on Mar 05, 2009 at 01:46 UTC

    CPAN to the rescue...

    From Lingua::StopWords:

    use Lingua::StopWords qw( getStopWords ); my $stopwords = getStopWords('en'); my @words = qw( i am the walrus goo goo g'joob ); # prints "walrus goo goo g'joob" print join ' ', grep { !$stopwords->{$_} } @words;

    From Lingua::EN::StopWords:

    use Lingua::EN::StopWords qw(%StopWords); my @words = ...; # Print non-stopwords in @words print join " ", grep { !$StopWords{$_} } @words;

Re: Removing Stopwords from a String
by Anonymous Monk on Nov 12, 2009 at 07:44 UTC
    this code didn't work :( is there any other way to remove stop words i am trying to remove stop words for a given french data.