Special_K has asked for the wisdom of the Perl Monks concerning the following question:
I am trying to use a regex to match lines that will always have an opening <div tag and could optionally have a closing </div tag on the same line. If the closing </div tag is present, additional code will be executed. Here is a sample that illustrates my problem:
#!/usr/bin/perl -w use strict; my $line = "<div id=\"roguebin-response-35911\" class=\"bin-response\" +></div>"; if ($line =~ /<div.+?(<\/div)*/) { printf("line matched\n"); if (defined($1)) { printf("right after match, 1 is defined\n"); } }
The output of running the above is:
line matched
I can't figure out why the closing div tag isn't being captured. I thought adding the non-greedy ? would prevent any closing div tags from getting consumed by the .+, but even with that addition the closing div tag isn't being captured.
EDIT: After more searching I found this SO thread which describes the same basic problem I have: https://stackoverflow.com/questions/28782603/regex-optional-capturing-group
After reviewing the example in that thread, I modified my original code to the following, which does work:
#!/usr/bin/perl -w use strict; my $line = "<div id=\"roguebin-response-35911\" class=\"bin-response\" +></div>"; if ($line =~ /<div.+?(<\/div>)*$/) { printf("line matched\n"); if (defined($1)) { printf("right after match, 1 is defined\n"); } }
I had to add the $ anchor at the end and also the closing > to the optional div capture group. I still don't quite understand how the regex engine is parsing this regex, however:
1. Why is it necessary to add the '>' in order for the capture group to work?
2. If I replace the '*' at the end of the optional capture group with a '?' (non-greedy qualifier), the group is still captured. Are '*' and '?' equivalent when applied to a group?
3. If I omit the '$' from the above regex, the optional div is not captured. The referenced SO thread says this regarding why the regex without the '$' fails to capture the optional group ('cat' changed to 'div' to be consistent with my code):
The reason that you do not get an optional div after a reluctantly-qualified .+? is that it is both optional and non-anchored: the engine is not forced to make that match, because it can legally treat the div as the "tail" of the .+? sequence.
My question is: generally speaking, how does Perl handle the case in which an optional or non-greedy match (.+? in this case) is followed by another optional or non-greedy match ((<\/div>)* in this case)? Does it always prefer to use more characters for one match (i.e. act greedy) rather than make additional matches (when those matches are optional)?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: problem with optional capture group
by davido (Cardinal) on Dec 22, 2020 at 17:20 UTC | |
|
Re: problem with optional capture group
by hippo (Archbishop) on Dec 22, 2020 at 17:31 UTC | |
by Special_K (Pilgrim) on Dec 22, 2020 at 21:42 UTC | |
by AnomalousMonk (Archbishop) on Dec 22, 2020 at 22:57 UTC | |
by Special_K (Pilgrim) on Dec 23, 2020 at 16:31 UTC | |
by AnomalousMonk (Archbishop) on Dec 23, 2020 at 20:54 UTC | |
|
Re: problem with optional capture group
by choroba (Cardinal) on Dec 22, 2020 at 21:57 UTC | |
|
Re: problem with optional capture group
by GrandFather (Saint) on Dec 22, 2020 at 20:05 UTC |