comment on

I am trying to use a regex to match lines that will always have an opening <div tag and could optionally have a closing </div tag on the same line. If the closing </div tag is present, additional code will be executed. Here is a sample that illustrates my problem:

#!/usr/bin/perl -w
use strict;

my $line = "<div id=\"roguebin-response-35911\" class=\"bin-response\"
+></div>";

if ($line =~ /<div.+?(<\/div)*/)
{
    printf("line matched\n");

    if (defined($1))
    {
        printf("right after match, 1 is defined\n");
    }
}
[download]

The output of running the above is:

line matched
[download]

I can't figure out why the closing div tag isn't being captured. I thought adding the non-greedy ? would prevent any closing div tags from getting consumed by the .+, but even with that addition the closing div tag isn't being captured.

EDIT: After more searching I found this SO thread which describes the same basic problem I have: https://stackoverflow.com/questions/28782603/regex-optional-capturing-group
After reviewing the example in that thread, I modified my original code to the following, which does work:

#!/usr/bin/perl -w
use strict;

my $line = "<div id=\"roguebin-response-35911\" class=\"bin-response\"
+></div>";

if ($line =~ /<div.+?(<\/div>)*$/)
{

    printf("line matched\n");

    if (defined($1))
    {
        printf("right after match, 1 is defined\n");
    }
}
[download]

I had to add the $ anchor at the end and also the closing > to the optional div capture group. I still don't quite understand how the regex engine is parsing this regex, however:

1. Why is it necessary to add the '>' in order for the capture group to work?

2. If I replace the '*' at the end of the optional capture group with a '?' (non-greedy qualifier), the group is still captured. Are '*' and '?' equivalent when applied to a group?

3. If I omit the '$' from the above regex, the optional div is not captured. The referenced SO thread says this regarding why the regex without the '$' fails to capture the optional group ('cat' changed to 'div' to be consistent with my code):

The reason that you do not get an optional div after a reluctantly-qualified .+? is that it is both optional and non-anchored: the engine is not forced to make that match, because it can legally treat the div as the "tail" of the .+? sequence.

My question is: generally speaking, how does Perl handle the case in which an optional or non-greedy match (.+? in this case) is followed by another optional or non-greedy match ((<\/div>)* in this case)? Does it always prefer to use more characters for one match (i.e. act greedy) rather than make additional matches (when those matches are optional)?

In reply to problem with optional capture group by Special_K

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.