RESOLVED: The explanation is available in perlre in the section Repeated Patterns Matching a Zero-length Substring.
Thus Perl allows such constructs, by forcefully breaking the infinite loop. The rules for this are different for lower-level loops given by the greedy quantifiers *+{} , and for higher-level ones like the /g modifier or split() operator. The lower-level loops are interrupted (that is, the loop is broken) when Perl detects that a repeated expression matched a zero-length substring.Here Perl is preventing start of string from repeatedly matching in an infinite loop.
Please accept my apologies for posting this question without first ensuring that I was completely unable to resolve it myself. I shall be more conscientious in future.
I often encounter unexpected behavior in my regular expressions, but can usually explain them after due thought. Here is behavior that (after head scratching) still puzzles me.
UPDATE (simplified) Here is a simplified example that uses start of string, 'X', 'Y' or end of string as the boundaries around phrases:
use warnings; use strict; my $str='XtestXYtest2'; my $BRK = qr(X|Y|^|$)i; my @pieces = $str =~ /($BRK+)(.+?(?=$BRK))/gsp; push @pieces,${^POSTMATCH} if ${^POSTMATCH}; print "INTENDED: These were the phrases (and breaks) extracted:\n",joi +n("\n",@pieces),"\n-------\n"; $BRK = qr(^|X|Y|$)i; @pieces = $str =~ /($BRK+)(.+?(?=$BRK))/gsp; push @pieces,${^POSTMATCH} if ${^POSTMATCH}; print "WRONG: These were the phrases (and breaks) extracted:\n",join(" +\n",@pieces),"\n";
I am still puzzled. $BRK+ can clearly match multiple times when different alternatives are matched. It seems there is something special about using start of string as an alternative.
FURTHER UPDATE: I think I am beginning to dimly understand what is going on. Consider what would happen if current position remained at start of string after matching it. It would keep matching repeatedly. Thus, there needs to be some kind of special handling when a zero length match is combined with a match quantifier. Either it needs to ignore the quantifier or handle the situation as a special case. Is this documented anywhere?
Background: I want to split a string, with known limited html attributes, into pieces with alternately
I am puzzled as to why
behaves differently when the alternatives are in a different order asmy $BRK = qr(^\s*|\s*<(?:/?(?:p|ul|ol|li)|br\s*/?)>\s*|\s*$)i;
Full example program:my $BRK = qr(\s*<(?:/?(?:p|ul|ol|li)|br\s*/?)>\s*|^\s*|\s*$)i;
use warnings; use strict; use utf8; open my $logfh, ">:encoding(utf8)", q(Debug.log) or die "Cannot open L +OG for writing: $!"; #If using RE below, $BRK+ will match start of string, but not (for exa +mple) <p> following my $BRK = qr(^\s*|\s*<(?:/?(?:p|ul|ol|li)|br\s*/?)>\s*|\s*$)i; #This one seems to do the right thing, but why the difference? #my $BRK = qr(\s*<(?:/?(?:p|ul|ol|li)|br\s*/?)>\s*|^\s*|\s*$)i; #my $str='<p><br>test<ul> <li><br />test2 with <b><i>bold italic</i></ +b> </ul> '; #Simple demo of the problem my $str='<p>test'; my @pieces = $str =~ /($BRK+)(\S.*?(?=$BRK))/gsp; push @pieces,${^POSTMATCH} if ${^POSTMATCH}; $logfh->print("These were the phrases (and breaks) extracted:\n",join( +"\n",@pieces),"\n");
I can see that having <p> match first helps, but why does it not match <p> immediately after matching start of string? Many thanks to whoever can point out what I am missing
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |