Regex: succeeds, but parens don't collect...

Wiggins has asked for the wisdom of the Perl Monks concerning the following question:

I have come to an empirical conclusion, and am seeking (?confirmation|judgement). The following is a minimal case. I was looking to regex match 2 fields, plus an optional 3rd from a string and extract the possibly 3 matches with paren collection.

Mu conclusion is that 'optional' success ('?'="zero or one") is not a 'match' as far as '(...)' collection is concerned:

$m = ' more gibberish"h" \n URL="http://[10.0.0.3]?id=80943lkjh875kjrv
+f09u548gfpi"\n gibber\n';

$u= qr/(:?URL="([^"]*)")?/is; # optional clause (:?xxx)?

if ($m =~ $u) {
    print "<$1><$2>\n";
}else{ print "no match u\n";}
[download]

results in

[tmp]> perl testResp2.pl 
<><>
[tmp]>
[download]

This result is the same, even if I edit "URL=..." to "URxL=...".
The evaluation is 'successful' but there was really no 'match' (or a match happened 0 times which is valid). Can I force a match to happen here 1 time (while keeping it optional)?

------------- Update-----------

So, these optional expressions "have no legs". They are tested at the current pos() in the target where they are found. In my case that is at the 1st space character at the start of the string.
In order to force the evaluation to move down the target, I tried prepending .*? or .* and finally settled on

$u= qr/.*(?=U)(?:URL="([^"]*)")?/s # moves to just before 'U'
[download]

But, of course, this is too contrived to be useful. There could be a 'U' anywhere that matches the first part.
This probably generalizes to *, and {0,n} as well.

Thanks all.

-------------Update 2-----------------
The following will find what I want, ignoring random 'U's.

$m = ' more gib U berish"h" \n URL="http://[10.0.0.3]?id=80943lkjh875k
+jrvf09u548gfpi"\n gibber\n';

$u= qr/(?:.*(?=U)(?:URL="([^"]*)")?)*/s; #lookahead to 'U'

if ($m =~ $u) {
    print "<$1>\n";
}else{ print "no match u\n";}
[download]

This positions to 'U' and tries the match, if it fails it moves on. The reason I am stuck on this is because my original regex would become

$p= qr/RESPONSE\sid="([^"]+?)"
       .*                    #random XML
       disposition="([^"]+?)"
       .*                    #random XML
       (?:.*(?=U)(?:URL="([^"]+?)")?)*
      /sx;
[download]

This will extract the response id($1), the disposition ($2)(delete|hold|fetch) and an optional URL (only with fetch); from a large message string.
And it is an intriguing puzzle.

Comment on Regex: succeeds, but parens don't collect... Select or Download Code

Replies are listed 'Best First'.
Re: Regex: succeeds, but parens don't collect... by kyle (Abbot) on Aug 19, 2008 at 14:18 UTC
This is a little suspicious: `/(:?aaa)?/` ...because `/(?:aaa)?/` would be non-capturing parentheses. What you have instead is a capturing parenthesis expression that begins with an optional colon. Is that what you meant? As to your question, your entire expression is optional (i.e., `/(aaa)?/`), so it doesn't really require Perl to match anything. In my little fooling around, I didn't find a way to get it to find the optional string while still keeping it optional, but there probably is a way.	[reply] [d/l] [select]
Re: Regex: succeeds, but parens don't collect... by jhourcle (Prior) on Aug 19, 2008 at 14:30 UTC
Mu conclusion is that 'optional' success ('?'="zero or one") is not a 'match' as far as '(...)' collection is concerned The problem is that it _is_ a match -- your regex will match nothing, so it's always going to succeed, therefore, the test will always be true for any string. Can I force a match to happen here 1 time (while keeping it optional)? I'm not sure exactly what you're trying to do -- if you remove the trailing question mark, when the `URL=...` isn't there, it'll go to the 'else' block, and you can handle whatever you need to for the fact that it didn't match. ... although, I will admit that I'm surprised it didn't capture the text -- I had assumed ? was greedy, as with + or * ... maybe someone else knows why this isn't the case here. update: and the answer to my question was hidden in anonymous monk's response: `)? end of \1 (NOTE: because you're using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in \1) ----------------------------------------------------------` [download]	[reply] [d/l] [select]
Re: Regex: succeeds, but parens don't collect... by JavaFan (Canon) on Aug 19, 2008 at 14:29 UTC
What you have is an other group (with a typo, but that doesn't matter, even if you switch the : and ?, you still get the same result), and tell Perl to match that group zero or more times. Such a regexp will always match, and always at the beginning of the string. In your case, it matches the empty string. So, the optional still matches, but it doesn't match what you think it will match. Why you have the outer (?: )? construct anyway isn't clear to me.	[reply]
Re: Regex: succeeds, but parens don't collect... by Anonymous Monk on Aug 19, 2008 at 14:17 UTC
use re 'debug'; $m = ' more gibberish"h" \n URL="http://[10.0.0.3]?id=80943lkjh875kjrv +f09u548gfpi"\n gibber\n'; $u= qr/(:?URL="([^"])")?/is; # optional clause (:?xxx)? if ($m =~ $u) { print "<$1><$2>\n"; }else{ print "no match u\n";} __END__ Compiling REx `(:?URL="([^"])")?' size 34 Got 276 bytes for offset annotations. first at 1 1: CURLYX[0] {0,1}(33) 3: OPEN1(5) 5: CURLY {0,1}(9) 7: EXACTF <:>(0) 9: EXACTF <URL=">(12) 12: OPEN2(14) 14: STAR(26) 15: ANYOF[\0-!#-\377{unicode_all}](0) 26: CLOSE2(28) 28: EXACTF <">(30) 30: CLOSE1(32) 32: WHILEM(0) 33: NOTHING(34) 34: END(0) minlen 0 Offsets: [34] 18[1] 0[0] 1[1] 0[0] 3[1] 0[0] 2[1] 0[0] 4[5] 0[0] 0[0] 9[1] 0[0] +14[1] 10[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 15[1] 0 +[0] 16[1] 0[0] 17[1] 0[0] 18[0] 18[0] 19[0] Matching REx `(:?URL="([^"])")?' against ` more gibberish"h" \n URL=" +http://[10.0.0.3]?id=80943lkjh875...' Setting an EVAL scope, savestack=3 0 <> < more gibber> \| 1: CURLYX[0] {0,1} 0 <> < more gibber> \| 32: WHILEM 0 out of 0..1 cc=140fc18 Setting an EVAL scope, savestack=9 0 <> < more gibber> \| 3: OPEN1 0 <> < more gibber> \| 5: CURLY {0,1} EXACTF <:> can match 0 times out of 1... Setting an EVAL scope, savestack=9 failed... restoring \1..\2 to undef failed, try continuation... 0 <> < more gibber> \| 33: NOTHING 0 <> < more gibber> \| 34: END Match successful! Freeing REx: `"(:?URL=\"([^\"])\")?"' [download] use YAPE::Regex::Explain; $u= qr/(:?URL="([^"])")?/is; # optional clause (:?xxx)? print YAPE::Regex::Explain->new($u)->explain; __END__ The regular expression: (?is-mx:(:?URL="([^"])")?) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?is-mx: group, but do not capture (case-insensitive) (with . matching \n) (with ^ and $ matching normally) (matching whitespace and # normally): ---------------------------------------------------------------------- ( group and capture to \1 (optional (matching the most amount possible)): ---------------------------------------------------------------------- :? ':' (optional (matching the most amount possible)) ---------------------------------------------------------------------- URL=" 'URL="' ---------------------------------------------------------------------- ( group and capture to \2: ---------------------------------------------------------------------- [^"]* any character except: '"' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \2 ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- )? end of \1 (NOTE: because you're using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in \1) ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- [download]	[reply] [d/l] [select]
Re: Regex: succeeds, but parens don't collect... by Bloodnok (Vicar) on Aug 19, 2008 at 15:03 UTC
Unless I'm missing something so obvious I can't see it, given the aforementioned (Re: Regex: succeeds, but parens don't collect...) typo, `$2` will always be empty since you have only one set of capturing parens in the RE - capturing just the URI. A user level that continues to overstate my experience :-))	[reply] [d/l]