Regex'ing backreferences

Uruk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex'ing backreferences by zejames (Hermit) on Sep 21, 2000 at 13:24 UTC
I think the problem here is to say that you do NOT want bar in $1. Because if you wanted it, it may be easier :-) Something like this should hopefully do want you want: `/foo(?!.?bar.?)(.?)foo/;` [download] You first try to find 'foo'. OK. Then you check if the substring between 'foo' and the other 'foo' contains 'bar', but with a zero-width assertion. If that does not contain 'bar', then you can save it into $1. Thank you :-) thanks to your question I have understood why zero-witdh assertion is really cool!! Hope this helps. Bye James Update: and what about doing: `/foo(?!.?bar.?foo)(.?)foo/;` [download] I think it is better, unless you show me another "worst case" :-))	[reply] [d/l] [select]
RE (tilly) 2: Regex'ing backreferences by tilly (Archbishop) on Sep 21, 2000 at 14:37 UTC
You unfortunately will disallow strings with bar after the second foo. What you want is this: `/foo((?:(?!bar).))foo/` [download] Note that (?:stuff) is a non-grouping set of parens, and (?!stuff) is a negative assertion. UPDATE* I noticed a while ago that my RE was incorrect, decided to leave it for someone else to find and mention. But nobody did so I will mention that I forgot a ?: `/foo((?:(?!bar).)*?)foo/` [download]	[reply] [d/l] [select]
RE: RE (tilly) 2: Regex'ing backreferences by japhy (Canon) on Sep 21, 2000 at 15:56 UTC
While this method works, it is a bit slow, since it checks for the existence of 'bar' EVERY character. Use the "unrolling the loop" technique, as found in "Mastering Regular Expressions": Update: tilly's right, mine fails to match on a specific type of string ("foobabfoo"). Fixed below. `m{ foo ( # save to $1 [^b]* # 0 or more non-b characters (?: b+ # 1 or more b's (?: a(?!r) # an 'a' NOT followed by an 'r' \| # or [^a] # a non-a character ) [^b]* # 0 or more non-b characters )* # that whole b... subset, 0 or more times ) foo # and foo }x;` [download] The general pattern behind unrolling the loop is to get your regex to look something like `OK* ( NOTOK OK* )*`. `$_="goto+F.print+chop;\n=yhpaj";F1:eval`	[reply] [d/l]
RE (tilly) 4: Regex'ing backreferences by tilly (Archbishop) on Sep 21, 2000 at 16:10 UTC
RE (tilly) 4 (good try): Regex'ing backreferences by tilly (Archbishop) on Sep 21, 2000 at 18:23 UTC
RE: RE (tilly) 4 (good try): Regex'ing backreferences by japhy (Canon) on Sep 21, 2000 at 18:28 UTC
RE (tilly) 2 (still wrong): Regex'ing backreferences by tilly (Archbishop) on Sep 21, 2000 at 18:27 UTC
What about "foohellofoobarfoo"? You should be matching and putting "hello" into $1, but your new version won't.	[reply]
Be obvious (Re: Regex'ing backreferences) by tye (Sage) on Sep 21, 2000 at 19:41 UTC
Here is another alternative: `/foo(.?)(foo\|bar)/ && "bar" ne $2` [download] One functional difference between this and tilly's and japhy's correct solutions is that it can clobber regex variables when the first half matches but the total expression is false. Another is that you can't embed it inside a larger regex. Which leads me to the "obvious" solution: `/foo(.?)foo/ && $1 !~ /bar/` [download] And after benchmarking the results are: `Benchmark: running japhy, obvious, tilly, and tye, each for at least 3 CPU seconds (sorted fastest to slowest)... obvious: 3.13 CPU @ 77.96/s (n=244) tilly: 3.41 CPU @ 33.43/s (n=114) tye: 3.29 CPU @ 20.06/s (n=66) japhy: 3.02 CPU @ 10.60/s (n=32)` [download] So the most "optimized" is the slowest and the most obvious is the fastest. I think I know which one I'll use next time. :) Here is the benchmarking code: #!/usr/bin/perl -w use strict; use Benchmark; my @pat= qw( fooxfoo fooxbar xfooxfoo xfooxbar fooxfoox fooxbarx xfooxfoox xfooxbarx ); my @fill= ( "ba", "fo", "bf", "bafo", "babrar" ); @pat= map { my $fill= $_ x 1000; map { ( my $pat= $_ ) =~ s/x/$fill/g; + $pat } @pat } @fill; sub obvious { /foo(.?)foo/ && $1 !~ /bar/ and $1 } sub tye { /foo(.?)(bar\|foo)/ && "bar" ne $2 and $1 } sub tilly { /foo((?:(?!bar).))foo/ and $1 } sub japhy { m{ foo ( # save to $1 [^b] # 0 or more non-b characters (?: (?>b+) # 1 or more b's (NO BACKTRACKING!) (?: a(?!r) # an 'a' NOT followed by an 'r' \| # or [^a] # a non-a character ) [^b]* # 0 or more non-b characters )* # that whole b... subset, 0 or more times ) foo # and foo }x and $1 } sub checkthese { my( $sub, $hTests )= @_; my %res; for my $meth ( keys %$hTests ) { my $res= $sub->( $hTests->{$meth} ); for( keys %res ) { if( $res ne $res{$_} ) { warn "$meth and $_ disagree!\n"; } } $res{$meth}= $res; } } checkthese( sub { join " ", map { $_[0]->() } @pat }, { tilly => \&tilly, japhy => \&japhy, tye => \&tye, obvious => \&obvious, } ); timethese( -3, { obvious => sub { map obvious(), @pat }, tye => sub { map tye(), @pat }, tilly => sub { map tilly(), @pat }, japhy => sub { map japhy(), @pat }, } ); [download] - tye (but my friends call me "Tye")	[reply] [d/l] [select]