lobs has asked for the wisdom of the Perl Monks concerning the following question:

So I am trying to remove an article(ex. a|an|the) from the string I extract from a document I tried two methods and both do not work. I would like some help. Here is both attempts:
$currentSentence =~ s/[\ba\b|\ban\b|\bthe\b]//g;
if($currentSentence =~ /([\S]+) ([\S]+) <head>lines?<\/head>/) { if ($2 =~ /[\ba\b|\ban\b|\bthe\b]/) { print "$2\n"; print $DST "$1"; } else { print $DST "$2"; } }

Replies are listed 'Best First'.
Re: Regular Expression, substitution
by atcroft (Abbot) on Apr 06, 2016 at 01:44 UTC

    I think your problem in the first regex is the braces ("square brackets") in place of parentheses. When I tried a similar regex, I got the expected results:

    $ perl -Mstrict -Mwarnings -le ' my $str = q{This is a test of a regex to remove the words "a", "an", a +nd "the" from a sentence.}; print q{Before: }, $c; $c =~ s/\b(an?|the)\b/gimsx; print q{After: }; ' This is a test of a regex to remove the words "a", "an", and "the" fro +m a sentence. This is test of regex to remove words "", "", and "" from sentence +.

    Hope that helps.

      Yes that seems like it should work, smashing new bugs but will try that out when I'm done with this new problem. Yes it woks perfectly fine thanks for understanding my poorly written example.
Re: Regular Expression, substitution
by AnomalousMonk (Archbishop) on Apr 06, 2016 at 04:50 UTC
    $currentSentence =~ s/[\ba\b|\ban\b|\bthe\b]//g;

    The main problem with the character class  [\ba\b|\ban\b|\bthe\b] in the quoted substitution is that it's a character class. The  \b resolves, I think, to a backspace control character (or maybe just a plain old  'b' character) and not a word boundary assertion as seems to be the intention. Likewise,  | is just a plain old  '|' character and not an alternation operator. So the character class finally becomes something like  [\banthe|] depending on just what  \b becomes.


    Give a man a fish:  <%-{-{-{-<

Re: Regular Expression, substitution
by Marshall (Canon) on Apr 06, 2016 at 01:42 UTC
    It would be most helpful if you could provide: (a)a few example lines and (b) expected output for those lines. Also when you run code and say "it doesn't work", that is very nonspecific. You don't provide any data to run your code against. I have no idea of what you are trying to accomplish.
Re: Regular Expression, substitution
by marinersk (Priest) on Apr 06, 2016 at 03:31 UTC

    Using your first example:

    #!/usr/bin/perl use strict; use warnings; $currentSentence =~ s/[\ba\b|\ban\b|\bthe\b]//g; exit;

    I get the following:

    P:\>rmv1.pl Global symbol "$currentSentence" requires explicit package name at P:\ +rmv1.pl line 4. Execution of P:\rmv1.pl aborted due to compilation errors.

    Then I fixed the error on line 4:

    #!/usr/bin/perl use strict; use warnings; my $currentSentence =~ s/[\ba\b|\ban\b|\bthe\b]//g; exit;

    Which yields the following:

    P:\>rmv2.pl Use of uninitialized value $currentSentence in substitution (s///) at +P:\rmv2.pl line 4.

    So, to fix that, I added a value based on your loose description:

    #!/usr/bin/perl use strict; use warnings; my $currentSentence = "The big dog rolled in an open field filled with + a type of grass."; $currentSentence =~ s/[\ba\b|\ban\b|\bthe\b]//g; exit;

    I get the following:

    P:\>rmv3.pl P:\>

    Now morbidly curioius, I added a line to display the result:

    #!/usr/bin/perl use strict; use warnings; my $currentSentence = "The big dog rolled in an open field filled with + a type of grass."; $currentSentence =~ s/[\ba\b|\ban\b|\bthe\b]//g; print "[$currentSentence]\n"; exit;

    I get the following:

    P:\>rmv4.pl [T big dog rolld i op fild filld wi yp of grss.]

    I'm now going to ask the question:

    What's with all the \baction in your regular expression?

      \b matches a word boundary. It's a zero width anchor. It works like qr/(?<=\w)(?=\W)|(?<=\W)(?=\w)/ which is likely to blow your mind unless you are comfortable with look around matches.

      Premature optimization is the root of all job security

        Actually, it's trickier than that.  qr/(?<=\w)(?=\W)|(?<=\W)(?=\w)/ requires a character before and after the assertion, whereas  \b can match at the start and end of a string. A kind of double-negative is needed in an equivalent look-around:

        c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'xx-xx'; ;; my $lbw = qr/(?<=\w)(?=\W)|(?<=\W)(?=\w)/; printf qq{$-[0] } while $s =~ m{ $lbw }xmsg; print qq{\n}; ;; my $wb = qr{ (?<!\w)(?!\W) | (?<!\W)(?!\w) }xms; printf qq{$-[0] } while $s =~ m{ $wb }xmsg; print qq{\n}; ;; printf qq{$-[0] } while $s =~ m{ \b }xmsg; " 2 3 0 2 3 5 0 2 3 5

        Update: Now,  \B is another story...


        Give a man a fish:  <%-{-{-{-<

        Thanks! I was just reading up on it so I could answer my own question.

        I can't imagine how I've gone decades not having bumped into it.