How do I read in a document, remove the stop words and then write the result to a new file?

demisheep has asked for the wisdom of the Perl Monks concerning the following question:

I am using the following example from Lingua::StopWords:

use Lingua::StopWords qw( getStopWords );
my $stopwords = getStopWords('en');

my @words = qw( i am the walrus goo goo g'joob );

# prints "walrus goo goo g'joob"
print join ' ', grep { !$stopwords->{$_} } @words;
[download]

How do I get it to use my $document, remove stopwords and print the results to a file? See my code here:

open(FILESOURCE, "sample.txt") or die("Unable to open requested file."
+);
my $document = <FILESOURCE>;
close (FILESOURCE);

open(TEST, "results_stopwords.txt") or die("Unable to open requested f
+ile.");

use Lingua::StopWords qw( getStopWords );
my $stopwords = getStopWords('en');

print join ' ', grep { !$stopwords->{$_} } $document;
[download]

I tried these variations:

print join ' ', grep { !$stopwords->{$_} } TEST;

print TEST join ' ', grep { !$stopwords->{$_} } @words;

Basically, how do I read in a document, remove the stop words and then write the result to a new file?

Comment on How do I read in a document, remove the stop words and then write the result to a new file? Select or Download Code

Replies are listed 'Best First'.
Re: How do I read in a document, remove the stop words and then write the result to a new file? by toolic (Bishop) on May 14, 2012 at 17:50 UTC
To write to a file, change: `open(TEST, "results_stopwords.txt") or die("Unable to open requested f +ile.");` [download] to: `open(TEST, ">results_stopwords.txt") or die("Unable to open requested +file.");` [download] open	[reply] [d/l] [select]
Re: How do I read in a document, remove the stop words and then write the result to a new file? by grg (Initiate) on May 14, 2012 at 18:53 UTC
You need to be sure to split each line from the input file into words before you check for stopwords. Here's a simplistic example. `... use Lingua::StopWords qw\|getStopWords\|; my $stopwords = getStopWords( 'en' ); open my $infile, '<', 'fulltext.txt' or die "$!\n"; open my $outfile, '>', 'nostopwords.txt' or die "$!\n"; while (my $line = <$infile>) { my @words_all = split /\s+/, $line; my @words_nostop = grep { !$stopwords->{$_} } @words_all; print {$outfile} join( ' ', @words_nostop ), "\n"; } close $infile; close $outfile;` [download]	[reply] [d/l]
Re: How do I read in a document, remove the stop words and then write the result to a new file? by ww (Archbishop) on May 14, 2012 at 19:17 UTC
toolic and grg answer your questions; your cross-post at stackoverflow has an alternate. Since the Lingua::StopWords documentation provides an example of how to get the stopwords themselves - - eg, function getStopWords(en) - - I have to read your intent as creating a new file with the content of the original less the stopwords. Do you intend to use the new file as part of an index; perhaps inserting the words into a database for the purpose of cross-referencing the source of certain words? Or do you have some other more arcane intent? Won't the original, minus the stock set of stopwords provided by the module, be intelligble?	[reply]
Re: How do I read in a document, remove the stop words and then write the result to a new file? by Anonymous Monk on May 14, 2012 at 17:42 UTC
http://stackoverflow.com/questions/10484620/how-can-i-get-this-to-print-to-my-file-instead-of-the-screen-in-my-perl-program	[reply]
Re: How do I read in a document, remove the stop words and then write the result to a new file? by Kenosis (Priest) on May 15, 2012 at 00:48 UTC
The getStopWords function returns a hash reference, which we can dereference to get the hash keys (the stop words) for multiple s///g operations across your entire document to remove all found stop words: `use strict; use warnings; use Lingua::StopWords qw(getStopWords); { open my $infile, '<fulltext.txt' or die $!; my $fulltext = do { local $/; <$infile> }; $fulltext =~ s/ ?\b$_\b ?//gi for keys %{ getStopWords('en') }; open my $outfile, '>nostopwords.txt' or die $!; print $outfile $fulltext; }` [download] Lexical variables (`my`) are used for file handles within the code block, so we don't need to explicitly `close` the `open`ed files, since the files will automatically `close` when those variables fall out of scope. Hope this helps!	[reply] [d/l] [select]