in reply to Removing common words

One thing that immediately jumped out at me is the question of why you're iterating over the keys of the hash just to determine if each of the individual hash keys match the word you're comparing from your input file. One of the express reasons for using a hash is so that you don't have to iterate over each element to find one.

If my task description were, "Read a text file, compare it to a banned word list, and then rewrite the text file, minus the banned words." I would probably do it something like this. ...note, this isn't a cut-and-paste solution for you, it's just an example from which you hopefully can derive a solution taylored to your needs...

use strict; # Please do this! use warnings; # And this! # First, read from the <DATA> filehandle to grab the list # of banned words. map the banned words into hash-keys. my %banned = map { chomp $_; $_=> undef } split /\s+/, <DATA>; my $infile = "textfile.txt"; my $tmpfile = "outfile.tmp"; open my $in, "<$infile" or die "Cannot open $infile.\n$!"; open my $out, ">$tmpfile" or die "Cannot open temp output file.\n$!"; while ( my $inline = <$in> ) { chomp $inline; my $printline = ""; foreach my $word ( split /\s+/, $inline ) { next if exists $banned{ lc $word }; $printline .= $word; } print $out "$printline\n"; } close $out; close $in; rename $tmpfile, $infile; __DATA__ a at be for and to of in the as i it are is am on an you me b c d e f g h j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 10

I hope this helps... good luck.


Dave

Replies are listed 'Best First'.
Re: Re: Removing common words
by aquarium (Curate) on Apr 04, 2004 at 08:25 UTC
    from the list of banned words it seems that you're banning all one letter and two letter words and a handful of 3 letter words. why not just discard/skip ALL one letter and two letter words?
      That would be a good optimization if the list of "banned" words were fixed and immutable. Then program logic could deal with all one-letter and two-letter words, as well as single-digit numbers. But I stuck with the philosophy of explicitally naming the "banned" words so that the list could be maintained without diving into the program's logic. I was also thinking of the possibility that there could be a banned-word file, rather than using the __DATA__ filehandle.

      Good point though; if I hadn't been designing with maintainability and flexibility in the word list in mind, I would completely agree that there is a more efficient way to block all single-letter and double-letter words.


      Dave