Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I'm trying to remove a set of words from a string. The problem is that it doesn't remove the words if they are mixed case, i.e. 'the' is in the stopword file, but 'The' in $title is not removed.
read_stopwordfile(); foreach (@words) { #print "$_<br>\n"; $title =~ s/([%\d])$_([%\d])/$1$2/gsi; }
Any ideas?

Replies are listed 'Best First'.
Re: removing words
by GrandFather (Saint) on Jan 03, 2007 at 04:13 UTC

    Assuming that your "words" are fairly conventional it may be that

    $title =~ s/\b$_\b//gsi;

    is what you are looking for. \b is a zero width assertion that matches word breaks. See perlre.


    DWIM is Perl's answer to Gödel

      And if this solution is appropriate for your problem, but you're suffering from performance issues (something which may show up because you're doing over and over so many passes at your data as the number of your words), you may give a try to Regexp::Assemble. Regexp::Assemble may be used to automatically build a regexp which matches all your words without hassle and possibly optimized with respect to one you wrote by hand.

      use Regexp::Assemble; my $ra = Regexp::Assemble->new; $ra->add(@words); my $re = $ra->re; s/$re//gsi; # this will replace # foreach (@words) { ... s/// ... } # and do the word deletion in one go

      The above assumes your words don't need quoting to become regexes.

      Also it may be handy the method add_file to read your stopwords file and assemble the regex in a single action.

Re: removing words
by quester (Vicar) on Jan 03, 2007 at 04:17 UTC
    As a guess, is the word "The" at the beginning of a line?

    Your pattern only removes words that have a digit or a percent sign on both sides.

    You didn't include a sample of your data, but for anything resembling normal English text the "word boundary" would work better:

    $title =~ s/\b$_\b//gsi;
    If the data really has words separated by digits or percent signs or the beginning/end of a line you might try this:
    $title =~ s/(^|[%\d])$_([%\d]|$)/$1$2/gsi;
Re: removing words
by muba (Priest) on Jan 03, 2007 at 04:07 UTC

    Any ideas?

    Certainly. Look. s/([%\d])$_([%\d])/. Let's break that down.

    s/ # begin substitution ([%\d]) # A captured character class for # a literal "%" and then \d # -- does that even work? # I would write it as [%0-9], but ok, # that's so beside the point. $_ # Ok, so we match $_ *AFTER* we matched # that character class ([%\d]) # And there's that cutie again
    hey wait... that's funny! Is there even any [%\d] before your "the" word? Or after it?

    To be more precise, what is the value of $title and does $title even match with the first part of your s/// expression?

Re: removing words
by johngg (Canon) on Jan 03, 2007 at 10:57 UTC
    Rather than using captures and substituting with $1$2 I would probably use look-behind and look-ahead assertions so that I can substitute just my stop word with nothing.

    use strict; use warnings; my @stopWords = qw{the on}; my $title = q{The% cat sat 4on5 the %tHe% hat %On}; print qq{Title : $title\n}; foreach my $stopWord (@stopWords) { print qq{Weeding : $stopWord\n}; # Do substitution using extended regular # expression syntax to allow comments in # the pattern. # $title =~ s{(?x) # Substitute ... (?: # Alternation for look-behind (?<=\A) # If preceded by string start | # or (?<=[%0-9]) # percent or digit ) # Close alternation (?i:$stopWord) # ... ignoring case, stop word ... (?=[%0-9]|\z) # If followed by percent or # digit, or string end }{}g; # ... with nothing globally print qq{Title : $title\n}; }

    The output is

    Title : The% cat sat 4on5 the %tHe% hat %On Weeding : the Title : % cat sat 4on5 the %% hat %On Weeding : on Title : % cat sat 45 the %% hat %

    I hope this is of use.

    Cheers,

    JohnGG