demisheep has asked for the wisdom of the Perl Monks concerning the following question:

I am using the following example from Lingua::StopWords:

use Lingua::StopWords qw( getStopWords ); my $stopwords = getStopWords('en'); my @words = qw( i am the walrus goo goo g'joob ); # prints "walrus goo goo g'joob" print join ' ', grep { !$stopwords->{$_} } @words;

How do I get it to use my $document, remove stopwords and print the results to a file? See my code here:

open(FILESOURCE, "sample.txt") or die("Unable to open requested file." +); my $document = <FILESOURCE>; close (FILESOURCE); open(TEST, "results_stopwords.txt") or die("Unable to open requested f +ile."); use Lingua::StopWords qw( getStopWords ); my $stopwords = getStopWords('en'); print join ' ', grep { !$stopwords->{$_} } $document;

I tried these variations:

print join ' ', grep { !$stopwords->{$_} } TEST;

print TEST join ' ', grep { !$stopwords->{$_} } @words;

Basically, how do I read in a document, remove the stop words and then write the result to a new file?

  • Comment on How do I read in a document, remove the stop words and then write the result to a new file?
  • Select or Download Code

Replies are listed 'Best First'.
Re: How do I read in a document, remove the stop words and then write the result to a new file?
by toolic (Bishop) on May 14, 2012 at 17:50 UTC
    To write to a file, change:
    open(TEST, "results_stopwords.txt") or die("Unable to open requested f +ile.");

    to:

    open(TEST, ">results_stopwords.txt") or die("Unable to open requested +file.");
    open
Re: How do I read in a document, remove the stop words and then write the result to a new file?
by grg (Initiate) on May 14, 2012 at 18:53 UTC
    You need to be sure to split each line from the input file into words before you check for stopwords. Here's a simplistic example.
    ... use Lingua::StopWords qw|getStopWords|; my $stopwords = getStopWords( 'en' ); open my $infile, '<', 'fulltext.txt' or die "$!\n"; open my $outfile, '>', 'nostopwords.txt' or die "$!\n"; while (my $line = <$infile>) { my @words_all = split /\s+/, $line; my @words_nostop = grep { !$stopwords->{$_} } @words_all; print {$outfile} join( ' ', @words_nostop ), "\n"; } close $infile; close $outfile;
Re: How do I read in a document, remove the stop words and then write the result to a new file?
by ww (Archbishop) on May 14, 2012 at 19:17 UTC

    toolic and grg answer your questions; your cross-post at stackoverflow has an alternate.

    Since the Lingua::StopWords documentation provides an example of how to get the stopwords themselves - - eg, function getStopWords(en) - - I have to read your intent as creating a new file with the content of the original less the stopwords.

    Do you intend to use the new file as part of an index; perhaps inserting the words into a database for the purpose of cross-referencing the source of certain words?

    Or do you have some other more arcane intent? Won't the original, minus the stock set of stopwords provided by the module, be intelligble?

Re: How do I read in a document, remove the stop words and then write the result to a new file?
by Anonymous Monk on May 14, 2012 at 17:42 UTC
Re: How do I read in a document, remove the stop words and then write the result to a new file?
by Kenosis (Priest) on May 15, 2012 at 00:48 UTC

    The getStopWords function returns a hash reference, which we can dereference to get the hash keys (the stop words) for multiple s///g operations across your entire document to remove all found stop words:

    use strict; use warnings; use Lingua::StopWords qw(getStopWords); { open my $infile, '<fulltext.txt' or die $!; my $fulltext = do { local $/; <$infile> }; $fulltext =~ s/ *?\b$_\b *?//gi for keys %{ getStopWords('en') }; open my $outfile, '>nostopwords.txt' or die $!; print $outfile $fulltext; }

    Lexical variables (my) are used for file handles within the code block, so we don't need to explicitly close the opened files, since the files will automatically close when those variables fall out of scope.

    Hope this helps!