I am trying to use a regex to match lines that will always have an opening <div tag and could optionally have a closing </div tag on the same line. If the closing </div tag is present, additional code will be executed. Here is a sample that illustrates my problem:
#!/usr/bin/perl -w use strict; my $line = "<div id=\"roguebin-response-35911\" class=\"bin-response\" +></div>"; if ($line =~ /<div.+?(<\/div)*/) { printf("line matched\n"); if (defined($1)) { printf("right after match, 1 is defined\n"); } }
The output of running the above is:
line matched
I can't figure out why the closing div tag isn't being captured. I thought adding the non-greedy ? would prevent any closing div tags from getting consumed by the .+, but even with that addition the closing div tag isn't being captured.
EDIT: After more searching I found this SO thread which describes the same basic problem I have: https://stackoverflow.com/questions/28782603/regex-optional-capturing-group
After reviewing the example in that thread, I modified my original code to the following, which does work:
#!/usr/bin/perl -w use strict; my $line = "<div id=\"roguebin-response-35911\" class=\"bin-response\" +></div>"; if ($line =~ /<div.+?(<\/div>)*$/) { printf("line matched\n"); if (defined($1)) { printf("right after match, 1 is defined\n"); } }
I had to add the $ anchor at the end and also the closing > to the optional div capture group. I still don't quite understand how the regex engine is parsing this regex, however:
1. Why is it necessary to add the '>' in order for the capture group to work?
2. If I replace the '*' at the end of the optional capture group with a '?' (non-greedy qualifier), the group is still captured. Are '*' and '?' equivalent when applied to a group?
3. If I omit the '$' from the above regex, the optional div is not captured. The referenced SO thread says this regarding why the regex without the '$' fails to capture the optional group ('cat' changed to 'div' to be consistent with my code):
The reason that you do not get an optional div after a reluctantly-qualified .+? is that it is both optional and non-anchored: the engine is not forced to make that match, because it can legally treat the div as the "tail" of the .+? sequence.
My question is: generally speaking, how does Perl handle the case in which an optional or non-greedy match (.+? in this case) is followed by another optional or non-greedy match ((<\/div>)* in this case)? Does it always prefer to use more characters for one match (i.e. act greedy) rather than make additional matches (when those matches are optional)?
In reply to problem with optional capture group by Special_K
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |