Removing common words

mkurtis has asked for the wisdom of the Perl Monks concerning the following question:

I'm pulling the most commonly used words out of a text file and putting them in another, but I don't want any a or an or the etc. in it. Here's what I got:

#!/usr/bin/perl 


for($i=1; $i<20000; $i++) { 
open(IN,"/var/www/data/$i.txt") || die $!;
my $url = <IN>;
my %banned = (
  "a" => undef,
  "at" => undef,
  "be" => undef,
  "for" => undef,
  "and" => undef,
  "to" => undef,
  "of" => undef,
  "in" => undef,
  "the" => undef,
  "as" => undef,
  "i" => undef,
  "it" => undef,
  "are" => undef,
  "is" => undef,
  "am" => undef,
  "on" => undef,
  "an" => undef,
  "you" => undef,   
  "me" => undef,
  "b" => undef,
  "c" => undef,
  "d" => undef,
  "e" => undef,
  "f" => undef,
  "g" => undef,
  "h" => undef,
  "j" => undef,
  "k" => undef,
  "l" => undef,
  "m" => undef,
  "n" => undef,
  "o" => undef,
  "p" => undef,
  "q" => undef,
  "r" => undef,
  "s" => undef,
  "t" => undef,
  "u" => undef,
  "v" => undef,
  "w" => undef,
  "x" => undef,
  "y" => undef,
  "z" => undef,
  "0" => undef,
  "1" => undef,
  "2" => undef,
  "3" => undef,
  "4" => undef,
  "5" => undef,
  "6" => undef,
  "7" => undef,
  "8" => undef,
  "9" => undef,
  "10" => undef,
);
while(<IN>){
chomp;
@words = split(/\W/, $_);
}
close(IN);
foreach $word (@words) {
      $word=lc($word);
      OUTER: while( ( $key, $value) = each %banned ) {
              if($key eq $word) {
                $word = "";
                last OUTER;
            } else  {
              $count{$word}++ unless $word eq "";
        }
      }
     }
@keys = reverse sort { $count{$a} <=> $count{$b} } keys %count;
@keys = splice(@keys,0,50);
print "File: $i\n";
open(FILE,">/var/www/sorted/$i.txt") || die $!;
print FILE "$url\n";
print FILE join("\n", @keys);
close(FILE);
$string = "";
%count = ();
}
[download]

But this doesn't remove the words. What's wrong?

Thanks

Comment on Removing common words Download Code

Replies are listed 'Best First'.
Re: Removing common words by Zaxo (Archbishop) on Apr 04, 2004 at 05:54 UTC
You don't do anything with values from %banned and you don't necessarily do lookups on the keys, so you might as well make it @banned. It is static, so you can define it only once, outside the loop. Try this on for size, `my @banned = qw( at be for and to of in the as it are is am on an you me a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 10 ); for (1 .. 20000) { open local(*IN),"/var/www/data/$_.txt" or next; my $url = <IN>; my %count; while (<IN>) { $count{lc $_}++ for split /\W+/; } close IN or die $!; delete @count{@banned}; print # all the output stuff }` [download] You are churning your wheels a little with some of your output code, but ditching temporary variables and using slices of anonymous arrays will fix that. You can replace `reverse sort {foo($a) <=> foo($b)}` with `sort {foo($b) <=> foo($a)}`. After Compline, Zaxo	[reply] [d/l] [select]
Re: Removing common words by davido (Cardinal) on Apr 04, 2004 at 06:50 UTC
One thing that immediately jumped out at me is the question of why you're iterating over the keys of the hash just to determine if each of the individual hash keys match the word you're comparing from your input file. One of the express reasons for using a hash is so that you don't have to iterate over each element to find one. If my task description were, "Read a text file, compare it to a banned word list, and then rewrite the text file, minus the banned words." I would probably do it something like this. ...note, this isn't a cut-and-paste solution for you, it's just an example from which you hopefully can derive a solution taylored to your needs... use strict; # Please do this! use warnings; # And this! # First, read from the <DATA> filehandle to grab the list # of banned words. map the banned words into hash-keys. my %banned = map { chomp $_; $_=> undef } split /\s+/, <DATA>; my $infile = "textfile.txt"; my $tmpfile = "outfile.tmp"; open my $in, "<$infile" or die "Cannot open $infile.\n$!"; open my $out, ">$tmpfile" or die "Cannot open temp output file.\n$!"; while ( my $inline = <$in> ) { chomp $inline; my $printline = ""; foreach my $word ( split /\s+/, $inline ) { next if exists $banned{ lc $word }; $printline .= $word; } print $out "$printline\n"; } close $out; close $in; rename $tmpfile, $infile; __DATA__ a at be for and to of in the as i it are is am on an you me b c d e f g h j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 10 [download] I hope this helps... good luck. Dave	[reply] [d/l]
Re: Re: Removing common words by aquarium (Curate) on Apr 04, 2004 at 08:25 UTC
from the list of banned words it seems that you're banning all one letter and two letter words and a handful of 3 letter words. why not just discard/skip ALL one letter and two letter words?	[reply]
Re: Re: Re: Removing common words by davido (Cardinal) on Apr 04, 2004 at 08:30 UTC
That would be a good optimization if the list of "banned" words were fixed and immutable. Then program logic could deal with all one-letter and two-letter words, as well as single-digit numbers. But I stuck with the philosophy of explicitally naming the "banned" words so that the list could be maintained without diving into the program's logic. I was also thinking of the possibility that there could be a banned-word file, rather than using the __DATA__ filehandle. Good point though; if I hadn't been designing with maintainability and flexibility in the word list in mind, I would completely agree that there is a more efficient way to block all single-letter and double-letter words. Dave	[reply]
Re: Removing common words by EdwardG (Vicar) on Apr 04, 2004 at 20:28 UTC
But this doesn't remove the words. What's wrong? Two things. First, look closely at line 65 You are not adding to @words but overwriting it for each line of <IN>. This is easy to fix, just use push like this: `push @words,split(/\W/, $_);` [download] Second, look at your loop labelled OUTER. You loop through the list of banned words, and for each banned word you check to see if it is equal to the candidate word. If not, you accept it into %count. This means you even though a word might be on the banned list, you will accept it into %count many times before finally "rejecting" it. The fix is simple. Replace lines 69 to 77 with `$count{$word}++ unless exists $banned{lc $word};` [download]	[reply] [d/l] [select]
Re: Removing common words by Anonymous Monk on Apr 04, 2004 at 05:22 UTC
Here's somewhere to start. Completely untested code, I just threw this together in the reply textarea :). It may not work at all or it may have glitches. Give it a shot if you will, change whatever you may wish. I'd say that the definition of a "word" your regex uses could use some fixing. my @banned = qw( a at be for and to of in the as i it are is am on an you me ); push( @banned, 0..10, 'a'..'z' ); my %banned; @banned{@banned} = undef; for my $file (0 .. 20000) { open( my $in, '<', '/var/www/data/' . $file . '.txt' ) or die "open failed: $!"; open( my $out, '>', '/var/www/sorted/' . $file . '.txt' ) or die "open failed: $!"; my %seen; while ( chomp( my $line = <$in> ) ) { for my $word ( split( /\W/, $line ) ) { $word = lc( $word ); ++$seen{$word} if ( exists( $banned{$word} ) ); } } my @frequent = ( sort { $seen{$b} <=> $seen{$a} } keys %seen )[0..50]; print $out join("\n", @frequent); close( $in ) or die "close failed: $!"; close( $out ) or die "close failed: $!"; } [download]	[reply] [d/l]
Re^2: Removing common words by Anonymous Monk on Apr 04, 2004 at 05:35 UTC
Also, if any file has the potential of containing less than 50 unique words not contained within @banned, you will need to adjust the 0..50 splice to avoid undefined entries. Grepping for defined values will do it.	[reply]
Re^2: Removing common words by Anonymous Monk on Apr 04, 2004 at 05:28 UTC
Exasperated sigh. I had to try it out. I got it right it seems all for the most important part. The one line should obviously read `++$seen{$word} unless ( exists( $banned{$word} ) );`. Note the 'unless', not 'if'.	[reply] [d/l]
Re: Removing common words by halley (Prior) on Apr 05, 2004 at 15:30 UTC
As a side note, the Moby Lexicon Project is a good resource for most-common words. It includes lists of tens of thousands of English words, and the top thousand words from a couple of different sources. -- `[ e d @ h a l l e y . c c ]`	[reply]