stew has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to remove certain words from a document. The document is stored in a database and the list of offending words is stored in a text file.

I have managed to open the words file and get them into an array and also get the document assigned to a variable.

What I'd like to do next (unless there is a better way of doing it!) is loop thru the list of bad words and replace them with '' (squat).. thus

$data = "DOCUMENT EXTRACTED FROM DB"; // etc open(WORDS, "commonwords.txt") || die ("Could not open common word +s file"); @stop_words = <WORDS>; close(WORDS); $_ = $data; foreach $stopword(@stop_words){ $data =~ ### DO SUBSTITUTION HERE }


I'm going to have to remove some punctuation such as ,. etc and HTML tags too before putting every word in the document into an index.

- any advice greatly appreciated.

Stew, novice

Replies are listed 'Best First'.
Re: Using a variable as pattern in substitution
by VSarkiss (Monsignor) on Jun 05, 2002 at 15:05 UTC

    You could do it with the loop, as you've hinted:

    foreach my $stopword (@stop_words) { $date =~ s/\b$stopword\b//g; }
    Another possibility is to combine the words into a single regex, and apply it once (note, this is untested):
    my $stop_pat = '\b(' . join('|', @stop_words) . ')\b'; $data =~ s/$stop_pat//g;
    Whether that will be faster depends a lot on the number of words, their lengths, and the length of the input string. Give each a try. You may also want to see if using the study function improves things any.

    HTH

      Wouldn't it be easier or quicker to do something along the lines of:

      @array2 = grep {$_ =~ s/\b$stopword\b//g } @array1;

      or does grep compare favorably to a foreach as far as efficiency?

      Some people fall from grace. I prefer a running start...

      UPDATE:

      After thinking on this a bit, I think this solution would work for the problem. Fellow monks, please point out my folly if I'm wrong with this.

      #!/usr/bin/perl use strict; my @list = qw( red blue orange yellow black brown green ); print join( ' ', @list ); print "\n"; @list = grep { m/e/io } @list; print join( ' ', @list ); print "\n";

      I realize my regex isn't what the original poster was trying to do, but the idea is the same I believe.

      This is what I was trying to say in response earlier but I seem to have had a brain fart this morning.

        Wouldn't it be easier or quicker to ...
        Maybe or maybe not, but it would also not do what the original is asking for. This is looping over the input data, not over the patterns. It's also generating a second array instead of doing it in-place. Regardless of speed, that's guaranteed to need more memory.

      Thanks for that, one thing when I first tried it it didn't work. My stop words were in a text file, one per line I had to

       chop $stopwords

      to get it to work.

      One more thing what does the \b mean?

      Stew

        You can chomp every element in the array when you slurp the file: chomp(@stopwords = <FILE>);(As a side note, in general, chomp is preferable to chop.)

        The \b says to look for a word boundary. In other words, if your target string is part of another word, the \b will keep it from matching.

        my $target = "another"; $target =~ /other/; # this matches $target =~ /\bother/; # this doesn't
        You can read more about it in the perlre document.

        Avoid chop unless you want to remove the last character from every line of your file. Are you sure the last one ends with a newline? If not, oops.. there goes a character of data. chomp is not so indiscriminate.

        Makeshifts last the longest.