gatito has asked for the wisdom of the Perl Monks concerning the following question:

I'm making some silly regex mistake. I would like to tag words that end with '.'- abbreviations and initials, for example "S.", and my substitution regex is wiping out the whitespace before and after the word. I could force the whitespace, but that would mess up cases where S is the last word in the sentence.

I've tried using word boundaries, but beyond that I'm not sure what to do. Thanks for advice in advance.

code

#!/usr/local/bin/perl $text = "Paulson and Federal Reserve Chairman Ben S. Bernanke proposed + the plan after the collapse"; $text =~ s/\b[^<]S\.[^=]\b/\<S\.=initial\>/g; print "$text\n";

actual result ( no space before < and after > ):

Paulson and Federal Reserve Chairman Ben<S.=initial>Bernanke proposed +the plan after the collapse
desired result ( preserves spaces ):
Paulson and Federal Reserve Chairman Ben <S.=initial> Bernanke propose +d the plan after the collapse

Replies are listed 'Best First'.
Re: Regex is eating up whitespace
by GrandFather (Saint) on Sep 29, 2008 at 22:12 UTC

    Aside from this probably being an inadequate approach to solving the actual problem, you are using character set matches where you need anchors. Changing [^<] to (?<!<) avoids matching the preceding character, but ensures that it isn't <. Try:

    $text =~ s/\b(?<!<)S\.(?!=)/<S.=initial>/g;

    Remember that the substitution replaces all the characters matched so you must either capture and insert any "extra" matched characters or not match them in the first place (anchors don't "match" characters in this sense).


    Perl reduces RSI - it saves typing
      Interesting, I wasn't aware of anchors at all. That definitely makes it easier.

      I ended up coming up with this, which seems to work fine and avoids previously tagged cases, but the anchored version is better.

      $text =~ s/(\bS\.(?=\s))/\<\1=initial\>/g;

        You should use $1 instead of \1 in the substitution. < and > are not magical in regexen and, apart from $, there are no magical characters in the substitution string in any case - you do not need to escape < and >. Perl allows you to use different delimiters for regexes which can often make the regex much easier to read. Consider:

        $text =~ s!(\bS\.(?=\s))!<$1=initial>!g;

        Perl reduces RSI - it saves typing
        You should definitely read the Regex chapter in the Camel book. It's a tough slog - took me about 10 rereads to finally get it. But, it's extremely worth it. I'm no regex master by any measure, but understanding that chapter went a long way to getting me fluent in Perl.

        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Regex is eating up whitespace
by moritz (Cardinal) on Sep 29, 2008 at 21:51 UTC

    Update: it seems that the missing <c>... tags were the problem, which ate up all character classes. In which case I recommend reading Writeup Formatting Tips.

    actual result ( no space before < and after > ):

    Actual result when I run your code: the regex doesn't match at all.

    If you give us code that produced what you say it does then we might help you to change it so that it produces what you want.

Re: Regex is eating up whitespace
by JavaFan (Canon) on Sep 29, 2008 at 21:48 UTC
    It seems your [^<] and [^=] are the "problem". They match the whitespace. Since you just want to match "S.", why not just match that?
    s/S\./<S.=initial>/g;
    will do.