Back reference in s///g ?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Back reference in s///g ? by kyle (Abbot) on Mar 24, 2008 at 17:48 UTC
The references `$1`, etc. refer to expressions in (capturing) parentheses. I think you want this: `$in =~ s/([a-zA-Z])\n([a-zA-Z])/$1\ $2/g;` [download]	[reply] [d/l] [select]
Re: Back reference in s///g ? by ww (Archbishop) on Mar 24, 2008 at 18:02 UTC
The second regex doesn't work because you're not capturing anything to `$1` or `$2`. To capture something to `$1`, etc, use parens. Further there's no need to escape the space in the replacement. But, ignoring the issue of "non-English" (do you mean a different character set, like Greek or Japanese by any chance?), you're close. Try this: `#!/usr/bin/perl use strict; use warnings; my $in = "foo bar blivitz done"; $in =~ s/\s+/\n/g; # replace spaces with \n print "after first regex: $in \n"; # $in =~ s/[a-zA-Z]\n[a-zA-Z]/$1\ $2/g; $in =~ s/([a-zA-Z])\n([a-zA-Z])/$1 $2/g; # remove newlines, restore s +paces print "after second regex: $in \n"; =head execution: ww@GIG:~/pl_test$ perl 675941.pl after first regex: foo bar blivitz done after second regex: foo bar blivitz done ww@GIG:~/pl_test$ =cut` [download]	[reply] [d/l]
Re: Back reference in s///g ? by olus (Curate) on Mar 24, 2008 at 17:50 UTC
You are not capturing anything. You must use the `()`. `$in =~ s/([a-zA-Z])\n([a-zA-Z])/$1\ $2/g;` [download]	[reply] [d/l] [select]
Re: Back reference in s///g ? by jwkrahn (Abbot) on Mar 24, 2008 at 18:11 UTC
You probably need to use zero-width look-behind/look-ahead assertions: `$in =~ s/(?<=[a-zA-Z])\n(?=[a-zA-Z])/ /g;` [download]	[reply] [d/l]
Re^2: Back reference in s///g ? by Anonymous Monk on Mar 24, 2008 at 19:27 UTC
Look-ahead and look-behind regex sub-expressions can be very useful; you should definitely look into them. See perlre and perlretut. Technically speaking, you are not using back references (which look like `\1` or `\2`), but capture variables (i.e., `$1` and `$2`). A back reference is used within a regular expression to refer to a sub-string that has been captured by the corresponding preceding pair of capturing parentheses. A capture variable is used after a regular expressison has done its work, but also allows access to text captured by the corresponding capturing parentheses. (The replacement string in a `s/pattern/replacement/` substitution operation is formed after the regular expression has executed.) And yes, capturing parentheses, like most other good things in life, have a (performance) price, although usually a trivial one. By the way, it would be possible to do what you want to do (as far as I understand it) without look-ahead/behind regex expressions, as follows: `$in =~ s{ ([^a-zA-Z]) \s+ ([^a-zA-Z]) }{$1\n$2}xmsg;` [download] (that is, any whitespace substring with non-alphabetic characters both before and after it is replaced with a single newline). (However, there is a potential edge-case 'failure' at the very beginning or the very end of the string with this approach. You should consider what you want to have happen if the string either begins or ends with a string of whitespace characters.)	[reply] [d/l] [select]
Re^2: Back reference in s///g ? by Anonymous Monk on Mar 24, 2008 at 18:26 UTC
Hi jwkrahn, Your solution look interesting. However, after I google it I am not so understand the zero-width assertion (I never touch the extended thing). Would you like to explain it little more? Thank you **For all, Thank you for your replies. The magic capturing parentheses works! However, as far I know, the parentheses is usually used with include the string store in variable into a pattern. If the parentheses is the only way to do the inclusion, would it mean I will get the capturing effect? Also, Does capturing cost more in a regular expression matching?	[reply]
Re^3: Back reference in s///g ? by ikegami (Patriarch) on Mar 24, 2008 at 18:54 UTC
The regex `/(?<=[a-zA-Z])\n(?=[a-zA-Z])/` means "An newline, preceeded by `/[a-zA-Z]/` and followed by `/[a-zA-Z]/`." The length of the match is one, starting at the newline. Contrast with `/[a-zA-Z]\n[a-zA-Z]/` which means means "A `/[a-zA-Z]/`, then a newline, then a `/[a-zA-Z]/`." The length of the match is three, starting at the char before the newline. Regarding your capture question, there are optimization in place in some circumstances. I don't know if this is one of them. But honestly, if you need to micro-optimize that much, get familiar with the benchmarker.	[reply] [d/l] [select]
Re^3: Back reference in s///g ? by jwkrahn (Abbot) on Mar 24, 2008 at 23:01 UTC
Does capturing cost more in a regular expression matching? According to perlre: WARNING: Once Perl sees that you need one of $&, $‘, or $’ anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression "(?: ... )" instead.) But if you never use $&, $‘ or $’, then patterns without capturing parentheses will not be penalized. So avoid $&, $’, and $‘ if you can, but if you can’t (and some algorithms really appreciate them), once you’ve used them once, use them at will, because you’ve already paid the price. As of 5.005, $& is not so costly as the other two. So there will be some minor overhead in using capturing parentheses.	[reply]