Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I want to use the back reference in s///g, but seems I failed with code below:

$in =~ s/\s+/\n/g; $in =~ s/[a-zA-Z]\n[a-zA-Z]/$1\ $2/g;
First I cut the character (non-English) by space, then I join them back if I cut them wrongly (e.g. space between 2 English alphabet). I found the regular expression in line 2 failed - the space is located correctly, but the character before and after were missing and replaced by null (so I see character disappeared) - then I know the back reference here is not working. Did I failed to call the back reference just because of my stupid syntax error?

Replies are listed 'Best First'.
Re: Back reference in s///g ?
by kyle (Abbot) on Mar 24, 2008 at 17:48 UTC

    The references $1, etc. refer to expressions in (capturing) parentheses. I think you want this:

    $in =~ s/([a-zA-Z])\n([a-zA-Z])/$1\ $2/g;
Re: Back reference in s///g ?
by ww (Archbishop) on Mar 24, 2008 at 18:02 UTC
    The second regex doesn't work because you're not capturing anything to $1 or $2. To capture something to $1, etc, use parens.

    Further there's no need to escape the space in the replacement.

    But, ignoring the issue of "non-English" (do you mean a different character set, like Greek or Japanese by any chance?), you're close. Try this:

    #!/usr/bin/perl use strict; use warnings; my $in = "foo bar blivitz done"; $in =~ s/\s+/\n/g; # replace spaces with \n print "after first regex: $in \n"; # $in =~ s/[a-zA-Z]\n[a-zA-Z]/$1\ $2/g; $in =~ s/([a-zA-Z])\n([a-zA-Z])/$1 $2/g; # remove newlines, restore s +paces print "after second regex: $in \n"; =head execution: ww@GIG:~/pl_test$ perl 675941.pl after first regex: foo bar blivitz done after second regex: foo bar blivitz done ww@GIG:~/pl_test$ =cut
Re: Back reference in s///g ?
by olus (Curate) on Mar 24, 2008 at 17:50 UTC

    You are not capturing anything. You must use the ().

    $in =~ s/([a-zA-Z])\n([a-zA-Z])/$1\ $2/g;
Re: Back reference in s///g ?
by jwkrahn (Abbot) on Mar 24, 2008 at 18:11 UTC

    You probably need to use zero-width look-behind/look-ahead assertions:

    $in =~ s/(?<=[a-zA-Z])\n(?=[a-zA-Z])/ /g;
      Look-ahead and look-behind regex sub-expressions can be very useful; you should definitely look into them. See perlre and perlretut.

      Technically speaking, you are not using back references (which look like \1 or \2), but capture variables (i.e., $1 and $2). A back reference is used within a regular expression to refer to a sub-string that has been captured by the corresponding preceding pair of capturing parentheses. A capture variable is used after a regular expressison has done its work, but also allows access to text captured by the corresponding capturing parentheses. (The replacement string in a s/pattern/replacement/ substitution operation is formed after the regular expression has executed.) And yes, capturing parentheses, like most other good things in life, have a (performance) price, although usually a trivial one.

      By the way, it would be possible to do what you want to do (as far as I understand it) without look-ahead/behind regex expressions, as follows:

      $in =~ s{ ([^a-zA-Z]) \s+ ([^a-zA-Z]) }{$1\n$2}xmsg;

      (that is, any whitespace substring with non-alphabetic characters both before and after it is replaced with a single newline). (However, there is a potential edge-case 'failure' at the very beginning or the very end of the string with this approach. You should consider what you want to have happen if the string either begins or ends with a string of whitespace characters.)

      Hi jwkrahn,

      Your solution look interesting. However, after I google it I am not so understand the zero-width assertion (I never touch the extended thing). Would you like to explain it little more? Thank you

      **For all,

      Thank you for your replies. The magic capturing parentheses works! However, as far I know, the parentheses is usually used with include the string store in variable into a pattern. If the parentheses is the only way to do the inclusion, would it mean I will get the capturing effect? Also, Does capturing cost more in a regular expression matching?

        The regex /(?<=[a-zA-Z])\n(?=[a-zA-Z])/ means "An newline, preceeded by /[a-zA-Z]/ and followed by /[a-zA-Z]/." The length of the match is one, starting at the newline.

        Contrast with /[a-zA-Z]\n[a-zA-Z]/ which means means "A /[a-zA-Z]/, then a newline, then a /[a-zA-Z]/." The length of the match is three, starting at the char before the newline.

        Regarding your capture question, there are optimization in place in some circumstances. I don't know if this is one of them. But honestly, if you need to micro-optimize that much, get familiar with the benchmarker.

        Does capturing cost more in a regular expression matching?

        According to perlre:

        WARNING: Once Perl sees that you need one of $&, $‘, or $’ anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression "(?: ... )" instead.) But if you never use $&, $‘ or $’, then patterns without capturing parentheses will not be penalized. So avoid $&, $’, and $‘ if you can, but if you can’t (and some algorithms really appreciate them), once you’ve used them once, use them at will, because you’ve already paid the price. As of 5.005, $& is not so costly as the other two.
        So there will be some minor overhead in using capturing parentheses.