in reply to Re: problem with optional capture group
in thread problem with optional capture group

The problem with using the + quantifier is that the entire regex will not match a line that has an opening <div tag but no closing </div tag on the same line. For example, your modification will not match the following:

my $line = "<div id=\"roguebin-response-35911\" class=\"bin-response\" +>";

I was hoping to write a single regex that will handle both cases, i.e. an opening <div tag with a closing </div tag on the same line, and an opening <div tag with no closing </div tag on the same line. I understand a built in parser would make this task easier, but I would still like to understand how to write a single regex that would capture both of these cases.

Replies are listed 'Best First'.
Re^3: problem with optional capture group
by AnomalousMonk (Archbishop) on Dec 22, 2020 at 22:57 UTC

    Win8 Strawberry 5.8.9.5 (32) Tue 12/22/2020 16:43:09 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings for my $line ( '<div id="foo-bar-321" class="bin-boff"></div>', '<div id="foo-bar-321" class="bin-boff"> </div>', '<div id="foo-bar-321" class="bin-boff">foo</div>', '<div id="foo-bar-321" class="bin-boff"> foo </div>', '<div id="foo-bar-321" class="bin-boff">', '<div id="foo-bar-321" class="bin-boff"> ', '<div id="foo-bar-321" class="bin-boff">foo', '<div id="foo-bar-321" class="bin-boff"> foo', ) { if ($line =~ m{ <div (?: (?! </div) .)+ (</div)? }xms) { print "line matched \n '$&' \n"; if (defined $1) { print " right after match, \$1 is defined '$1' \n"; } } } ^Z line matched '<div id="foo-bar-321" class="bin-boff"></div' right after match, $1 is defined '</div' line matched '<div id="foo-bar-321" class="bin-boff"> </div' right after match, $1 is defined '</div' line matched '<div id="foo-bar-321" class="bin-boff">foo</div' right after match, $1 is defined '</div' line matched '<div id="foo-bar-321" class="bin-boff"> foo </div' right after match, $1 is defined '</div' line matched '<div id="foo-bar-321" class="bin-boff">' line matched '<div id="foo-bar-321" class="bin-boff"> ' line matched '<div id="foo-bar-321" class="bin-boff">foo' line matched '<div id="foo-bar-321" class="bin-boff"> foo'


    Give a man a fish:  <%-{-{-{-<

      m{ <div (?: (?! </div) .)+ (</div)? }xms)

      Can you please give a brief explanation regarding how the above regex works? It seems to use a few constructs I've never seen before and searching Google for regex symbols doesn't work very well. In particular, is enclosing a regex in 'm()', as you have done above, equivalent to enclosing it in '//'? What is the trailing xms doing?

        ... enclosing a regex in 'm()' ...

        The
           m open-delimiter pattern close-delimiter
        form is what I think of as the "canonical" form of the m// operator, where the delimiters can be a wide variety of characters including {} () <> [] matching braces. The // match form is a special case. Likewise the qr// s/// operators. This alleviates a lot of escape-ology connected with the / character in regexes. See perlop. (Note that q// qq// qx// qw// tr/// y/// and maybe some others also use this delimiter convention.)

        What is the trailing xms doing?

        I use the /ms modifiers as part of a standard "tail" on all my qr// m// s/// expressions to give the . ^ $ operators a standard | fixed behavior. This eliminates some degrees of freedom in regex behavior and makes them slightly easier to understand. The /x modifier in the standard tail enables the use of whitespace to help clarify a regex. See Modifiers in perlre.

        (?: (?! </div) .)+

        This has already been covered by GrandFather here. This expression just steps forward grabbing one character after another as long as that character is not a part of whatever matches the (?!...) negative lookahead expression, a closing div tag fragment in this case. A bit slow perhaps, but effective and flexible (update: flexible in that the lookahead expression can be of any complexity). See Lookaround Assertions in perlre; see also perlretut, perlreref and perlrequick.

        (</div)?

        Optionally capture a literal character sequence if it is present. The capture variable $1 (in this case) will hold the captured sequence if it was present, otherwise $1 will be undefined. See perlre, etc., as above.


        Give a man a fish:  <%-{-{-{-<