the_0ne has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, have a question on a double regex I'm trying to perform. I'm trying to swap multiple word strings and single word string in a larger string with the match encased in <B> tags. This code...

@array = ( "foo bar", "bar foo" ); @array2 = ( "foo", "bar" ); $string = "there is a foo bar and a bar foo and also foo and bar."; print "string before - $string\n"; $swapString = join ("|", @array); print "swapString 1 - $swapString\n"; $string =~ s/($swapString)/<B>$1<\/B>/gi; print "string after first - $string\n";

...works correctly by giving me this...

string after first - there is a <B>foo bar</B> and a <B>bar foo</B> and also foo and bar.

The 'foo bar' and 'bar foo' are tagged correctly. Then I perform the swap on the single words, but I don't want to put bold tags on things that should not be like the <B>foo bar<B>, so I run this regex...

$swapString = join ("|", @array2); print "swapString 2 - [$swapString]\n"; $string =~ s/[^<B>]($swapString)[^<\/B>]/<B>$1<\/B>/gi; print "string after first - $string\n";

This gives me this...

string after first - there is a <B>foo bar</B> and a <B>bar foo</B> and also<B>foo</B>and<B>bar</B>

...now this looks like it worked correctly, but notice the spaces between the words where the newly inserted bold tags are. They're gone and so is the period on the end. I'm not understanding how this...

$string =~ s/[^<B>]($swapString)[^<\/B>]/<B>$1<\/B>/gi;

...removed the first character before the match and the last character after the match. Can somebody explain this to me? Obviously I'm doing something wrong, but it looks like I have it right and it actually does work correctly except for the character before and after the match. Not sure why they disappear.

Thanks again monks for any knowledge you can share.

Replies are listed 'Best First'.
Re: Double regex match not working correctly.
by dailylemma (Scribe) on May 24, 2001 at 08:56 UTC
    The problem seems to be the location of your ()'s. To get the substitution to work correctly, use $string =~ s/([^<B>]($swapString)[^<\/B>])/<B>$1<\/B>/gi; Update:
    if you don't want the period and spaces to be bold, use
    $string =~ s/([^<B>])($swapString)([^<\/B>])/$1<B>$2<\/B>$3/gi;
      Actually, the second one handled it exactly like I wanted. Thanks, I tried a variance of that but noticed I placed the parenthesis the wrong way that time also.

      However, the first one didn't seem to work how I thought it would, not sure I think I copied it exactly how you had it, but it didn't work correctly for me. It gave me this...

      string after first - bar there is a <B>foo bar</B> and a <B>bar foo</B> and also foo <B>foo</B>and bar

      ...but the second substitution you gave worked perfectly. Thanks for showing me the err of my ways.
Re: Double regex match not working correctly.
by stephen (Priest) on May 24, 2001 at 10:40 UTC
    If all you're trying to do is replace a series of words, some of which are substrings of other words in the series, there is a simpler way--

    You could simply have your replacing regexp do all of the replacing at once. Regexp 'or' matches try to match the leftmost string first, so you could simply combine your two lists as such:

    @array = ( "foo bar", "bar foo" ); @array2 = ( "foo", "bar" ); $string = "there is a foo bar and a bar foo and also foo and bar."; print "string before - $string\n"; $swapString = join ("|", @array, @array2); print "swapString 1 - $swapString\n"; $string =~ s/($swapString)/<B>$1<\/B>/gi; print "string after first - $string\n";
    Which prints out:
    string before - there is a foo bar and a bar foo and also foo and bar. swapString 1 - foo bar|bar foo|foo|bar string after first - there is a <B>foo bar</B> and a <B>bar foo</B> an +d also <B>foo</B> and <B>bar</B>.
    To generalize this, you might want to use:
    $swapString = join ("|", sort { length($b) <=> length($a) } @array, @a +rray2);
    which will ensure that your lists will replace the longest matches before the shortest.

    If there's some other reason for needing to do this in two passes, disregard this solution.

    stephen

Re: Double regex match not working correctly.
by larryk (Friar) on May 24, 2001 at 12:27 UTC
    Just a note since the other guys have answered your Q.
    $string =~ s/[^<B>]($swapString)[^<\/B>]/<B>$1<\/B>/gi;
    Square brackets are for defining a list of characters to match any one of (or not with a ^) in a single character position. The [^<B>] and [^<\/B>] will match any character which is not inside of the square brackets. Not the string <B> or </B> as I believe you were trying to not match.

    I expect I could explain this more clearly if it wasn't first thing in the morning! Basically with your $string the regex above would be as well written...

    $string =~ s/[^>]($swapString)[^<]/<B>$1<\/B>/gi;
    and if you want to not match something that's more than one character then try...
    #!/usr/bin/perl -w use strict; $_ = "<b>foo</b> foo <b>bar</b> bar <b>foo</b> foo <b>bar</b> bar\n"; print; s/(?!<b>)(foo|bar)(?!<\/b>)/XXX/gi; print;
    larryk
Re: Double regex match not working correctly.
by Masem (Monsignor) on May 24, 2001 at 15:07 UTC
    Note that you can avoid doing the second regex all together by putting the two arrays together, the longer phrases in front of the shorter ones to allow the greediness of regex to work..

    That is, if you make the join in join "|", (@array, @array2), then I get the following without modifying your regex at all:

    string after first - there is a <B>foo bar</B> and a <B>bar foo</B> an +d also <B>foo</B> and <B>bar</B>.


    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
      Thanks Masem and all.

      I thought of doing it that way first, but in my mind I didn't think it would work because I thought something like this would happen...

      string after first - <B>bar</B> there is a <B><B>foo</B> <B>bar</B></B +> and a <B><B>bar</B> <B>foo</B></B> and also <B>foo</B> and <B>bar</ +B>.
      See how the multiple words also matches for the single words and now we have compounded <B> tags. That's what I thought would have happened. Since I was using the g modifier I would think the regex would go through and transform the multiple word matches and then go through and match the single words also, which would not be good. But now I see that it would have worked. Next time I'll try my first instinct.

      Thanks to all that posted.
Re: Double regex match not working correctly.
by tachyon (Chancellor) on May 24, 2001 at 10:15 UTC
    $string =~ s/[^<B>]($swapString)[^<\/B>]/<B>$1<\/B>/gi;

    ...removed the first character before the match and the last character after the match.Can somebody explain this to me?

    Yes this is quite simple on the left you match

    [^<B>] <- this matches any *single* char that is not a < B > ($swapString) <- this put the match for $swapstring into $1 [^<\/B>] <- this matches any *single* char that is not a < / B >

    So you are substituting the characters before and after $swapString, but as you do not capture them they naturally disappear as you do not replace them in the replacement.

    The suggestion that you move the capture parenths kind of works but gives this output.

    $string =~ s/([^<B>]($swapString)[^<\/B>])/<B>$1<\/B>/gi; there is a <B>foo bar</B> and a <B>bar foo</B> and also<B> foo </B>and +<B> bar.</B>

    This is a better way to do things using what are know as lookback assertions.

    $string =~ s/(?<!<B>)($swapString)(?!<\/B>)/<B>$1<\/B>/gi; # this gives: there is a <B>foo bar</B> and a <B>bar foo</B> and also <B>foo</B> and + <B>bar</B>.

    which I expect is what you had in mind.

    The lookback assertions give you a sneak peak of what is or is not immediately around a match. They do not eat up the string so you do not need to replace the bits they match.

    The assertions are:

    (?=pattern)
    A zero-width positive look-ahead assertion. For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab in $&.

    (?!pattern)
    A zero-width negative look-ahead assertion. For example /foo(?!bar)/ matches any occurrence of ``foo'' that isn't followed by ``bar''. Note however that look-ahead and look-behind are NOT the same thing. You cannot use this for look-behind.

    If you are looking for a ``bar'' that isn't preceded by a ``foo'', /(?!foo)bar/ will not do what you want. That's because the (?!foo) is just saying that the next thing cannot be ``foo''--and it's not, it's a ``bar'', so ``foobar'' will match. You would have to do something like /(?!foo)...bar/ for that. We say ``like'' because there's the case of your ``bar'' not having three characters before it. You could cover that this way: /(?:(?!foo)...|^.{0,2})bar/. Sometimes it's still easier just to say: if (/bar/ && $` !~ /foo$/)

    For look-behind see below.

    (?<=pattern)
    A zero-width positive look-behind assertion. For example, /(?<=\t)\w+/ matches a word that follows a tab, without including the tab in $&. Works only for fixed-width look-behind.

    (?<!pattern)
    A zero-width negative look-behind assertion. For example /(?<!bar)foo/ matches any occurrence of ``foo'' that does not follow ``bar''. Works only for fixed-width look-behind.

    hope this helps

    tachyon