removing words

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: removing words by GrandFather (Saint) on Jan 03, 2007 at 04:13 UTC
Assuming that your "words" are fairly conventional it may be that `$title =~ s/\b$_\b//gsi;` [download] is what you are looking for. `\b` is a zero width assertion that matches word breaks. See perlre. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: removing words by ferreira (Chaplain) on Jan 03, 2007 at 10:27 UTC
And if this solution is appropriate for your problem, but you're suffering from performance issues (something which may show up because you're doing over and over so many passes at your data as the number of your words), you may give a try to Regexp::Assemble. `Regexp::Assemble` may be used to automatically build a regexp which matches all your words without hassle and possibly optimized with respect to one you wrote by hand. `use Regexp::Assemble; my $ra = Regexp::Assemble->new; $ra->add(@words); my $re = $ra->re; s/$re//gsi; # this will replace # foreach (@words) { ... s/// ... } # and do the word deletion in one go` [download] The above assumes your words don't need quoting to become regexes. Also it may be handy the method `add_file` to read your stopwords file and assemble the regex in a single action.	[reply] [d/l] [select]
Re: removing words by quester (Vicar) on Jan 03, 2007 at 04:17 UTC
As a guess, is the word "The" at the beginning of a line? Your pattern only removes words that have a digit or a percent sign on both sides. You didn't include a sample of your data, but for anything resembling normal English text the "word boundary" would work better: `$title =~ s/\b$_\b//gsi;` [download] If the data really has words separated by digits or percent signs or the beginning/end of a line you might try this: `$title =~ s/(^\|[%\d])$_([%\d]\|$)/$1$2/gsi;` [download]	[reply] [d/l] [select]
Re: removing words by muba (Priest) on Jan 03, 2007 at 04:07 UTC
Any ideas? Certainly. Look. `s/([%\d])$_([%\d])/`. Let's break that down. `s/ # begin substitution ([%\d]) # A captured character class for # a literal "%" and then \d # -- does that even work? # I would write it as [%0-9], but ok, # that's so beside the point. $_ # Ok, so we match $_ AFTER we matched # that character class ([%\d]) # And there's that cutie again` [download] hey wait... that's funny! Is there even any `[%\d]` before your "the" word? Or after it? To be more precise, what is the value of `$title` and does `$title` even match with the first part of your `s///` expression?	[reply] [d/l] [select]
Re: removing words by johngg (Canon) on Jan 03, 2007 at 10:57 UTC
Rather than using captures and substituting with `$1$2` I would probably use look-behind and look-ahead assertions so that I can substitute just my stop word with nothing. use strict; use warnings; my @stopWords = qw{the on}; my $title = q{The% cat sat 4on5 the %tHe% hat %On}; print qq{Title : $title\n}; foreach my $stopWord (@stopWords) { print qq{Weeding : $stopWord\n}; # Do substitution using extended regular # expression syntax to allow comments in # the pattern. # $title =~ s{(?x) # Substitute ... (?: # Alternation for look-behind (?<=\A) # If preceded by string start \| # or (?<=[%0-9]) # percent or digit ) # Close alternation (?i:$stopWord) # ... ignoring case, stop word ... (?=[%0-9]\|\z) # If followed by percent or # digit, or string end }{}g; # ... with nothing globally print qq{Title : $title\n}; } [download] The output is `Title : The% cat sat 4on5 the %tHe% hat %On Weeding : the Title : % cat sat 4on5 the %% hat %On Weeding : on Title : % cat sat 45 the %% hat %` [download] I hope this is of use. Cheers, JohnGG	[reply] [d/l] [select]