IB2017 has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks

I need some help in adapting the following Regex:

my ($rx) = map qr/(?:$_)/, join "|", map qr/\b\Q$_\E\b/, @stopwords;

which I use to remove from a string all stopwords contained in @stopwords by means of $string =~ s/$rx//g;. This works fine except for the cases that a word contains a hyphen and one part of the word happens to be a stop word. So for example, the French word "sous-alimentation" looses "sous" (being "sous" in my @stowords), even if it should retain it as being part of a terminological unit. Any idea how can I avoid it? (I thought to use whitespaces in my $rx, but I guess they could cause problems at the beginning and end of the string). Thank you.

Replies are listed 'Best First'.
Re: Improving regular expression to remove stopwords
by hippo (Archbishop) on Jan 09, 2019 at 12:01 UTC

    A negative look-ahead would meet that particular requirement. SSCCE:

    use strict; use warnings; use Test::More; my @stopwords = qw/foo sous bar/; my ($rx) = map qr/(?:$_)/, join "|", map qr/\b\Q$_\E\b(?!-)/, @stopwor +ds; my @stop = ( 'foo is good', 'so is sous' ); my @go = ( 'sous-alimentation', ); plan tests => @stop + @go; for my $str (@stop) { like ($str, $rx, "$str matched"); } for my $str (@go) { unlike ($str, $rx, "$str not matched"); }

      Fantastic... also the - for me new - way to test it.

Re: Improving regular expression to remove stopwords
by Veltro (Hermit) on Jan 09, 2019 at 12:06 UTC

    Hi IB2017

    You may want to look into a so called 'lookahead' or 'lookbehind'. In the following example I use a 'Negative Lookahead' by specifying (?!\-)

    use strict ; use warnings ; my @stopwords = qw{ sous } ; my $string = "sous sous-alimentation" ; my ($rx) = map qr/(?:$_)/, join "|", map qr/\b\Q$_\E\b(?!\-)/, @stopwo +rds; $string =~ s/$rx//g ; print $string ; __END__ sous-alimentation

    Veltro

    edit: Link: Extended Patterns

Re: Improving regular expression to remove stopwords
by AnomalousMonk (Archbishop) on Jan 09, 2019 at 18:39 UTC