in reply to Back reference in s///g ?

You probably need to use zero-width look-behind/look-ahead assertions:

$in =~ s/(?<=[a-zA-Z])\n(?=[a-zA-Z])/ /g;

Replies are listed 'Best First'.
Re^2: Back reference in s///g ?
by Anonymous Monk on Mar 24, 2008 at 19:27 UTC
    Look-ahead and look-behind regex sub-expressions can be very useful; you should definitely look into them. See perlre and perlretut.

    Technically speaking, you are not using back references (which look like \1 or \2), but capture variables (i.e., $1 and $2). A back reference is used within a regular expression to refer to a sub-string that has been captured by the corresponding preceding pair of capturing parentheses. A capture variable is used after a regular expressison has done its work, but also allows access to text captured by the corresponding capturing parentheses. (The replacement string in a s/pattern/replacement/ substitution operation is formed after the regular expression has executed.) And yes, capturing parentheses, like most other good things in life, have a (performance) price, although usually a trivial one.

    By the way, it would be possible to do what you want to do (as far as I understand it) without look-ahead/behind regex expressions, as follows:

    $in =~ s{ ([^a-zA-Z]) \s+ ([^a-zA-Z]) }{$1\n$2}xmsg;

    (that is, any whitespace substring with non-alphabetic characters both before and after it is replaced with a single newline). (However, there is a potential edge-case 'failure' at the very beginning or the very end of the string with this approach. You should consider what you want to have happen if the string either begins or ends with a string of whitespace characters.)

Re^2: Back reference in s///g ?
by Anonymous Monk on Mar 24, 2008 at 18:26 UTC

    Hi jwkrahn,

    Your solution look interesting. However, after I google it I am not so understand the zero-width assertion (I never touch the extended thing). Would you like to explain it little more? Thank you

    **For all,

    Thank you for your replies. The magic capturing parentheses works! However, as far I know, the parentheses is usually used with include the string store in variable into a pattern. If the parentheses is the only way to do the inclusion, would it mean I will get the capturing effect? Also, Does capturing cost more in a regular expression matching?

      The regex /(?<=[a-zA-Z])\n(?=[a-zA-Z])/ means "An newline, preceeded by /[a-zA-Z]/ and followed by /[a-zA-Z]/." The length of the match is one, starting at the newline.

      Contrast with /[a-zA-Z]\n[a-zA-Z]/ which means means "A /[a-zA-Z]/, then a newline, then a /[a-zA-Z]/." The length of the match is three, starting at the char before the newline.

      Regarding your capture question, there are optimization in place in some circumstances. I don't know if this is one of them. But honestly, if you need to micro-optimize that much, get familiar with the benchmarker.

      Does capturing cost more in a regular expression matching?

      According to perlre:

      WARNING: Once Perl sees that you need one of $&, $‘, or $’ anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression "(?: ... )" instead.) But if you never use $&, $‘ or $’, then patterns without capturing parentheses will not be penalized. So avoid $&, $’, and $‘ if you can, but if you can’t (and some algorithms really appreciate them), once you’ve used them once, use them at will, because you’ve already paid the price. As of 5.005, $& is not so costly as the other two.
      So there will be some minor overhead in using capturing parentheses.