in reply to problem with optional capture group

While I entirely agree with davido's exhortation to use a proper parser for HTML, I will answer your question because it is (in this one instance) fairly trivial. The capture group does not match because you have used the asterisk as the quantifier after it. This matches zero or more instances, and zero is, of course, the shortest.

Here's your code with a few small tweaks and the key change of using the plus as the quantifier:

#!/usr/bin/perl use strict; use warnings; my $line = '<div id="roguebin-response-35911" class="bin-response"></d +iv>'; if ($line =~ /<div.+?(<\/div)+/) { print "line matched\n"; if (defined $1) { print "right after match, 1 is defined\n"; } }

Similarly you don't really need a quantifier at all here because there is only one closing div in the string and one is the default quantity of anything in a regex.

I've used print instead of printf because you are not doing any format conversion. I've removed some unnecessary brackets and have used single quotes to delimit the initial string so the internal double quotes no longer need escaping (and you aren't interpolating in this string either).

But seriously, use a parser.


🦛

Replies are listed 'Best First'.
Re^2: problem with optional capture group
by Special_K (Pilgrim) on Dec 22, 2020 at 21:42 UTC

    The problem with using the + quantifier is that the entire regex will not match a line that has an opening <div tag but no closing </div tag on the same line. For example, your modification will not match the following:

    my $line = "<div id=\"roguebin-response-35911\" class=\"bin-response\" +>";

    I was hoping to write a single regex that will handle both cases, i.e. an opening <div tag with a closing </div tag on the same line, and an opening <div tag with no closing </div tag on the same line. I understand a built in parser would make this task easier, but I would still like to understand how to write a single regex that would capture both of these cases.

      Win8 Strawberry 5.8.9.5 (32) Tue 12/22/2020 16:43:09 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings for my $line ( '<div id="foo-bar-321" class="bin-boff"></div>', '<div id="foo-bar-321" class="bin-boff"> </div>', '<div id="foo-bar-321" class="bin-boff">foo</div>', '<div id="foo-bar-321" class="bin-boff"> foo </div>', '<div id="foo-bar-321" class="bin-boff">', '<div id="foo-bar-321" class="bin-boff"> ', '<div id="foo-bar-321" class="bin-boff">foo', '<div id="foo-bar-321" class="bin-boff"> foo', ) { if ($line =~ m{ <div (?: (?! </div) .)+ (</div)? }xms) { print "line matched \n '$&' \n"; if (defined $1) { print " right after match, \$1 is defined '$1' \n"; } } } ^Z line matched '<div id="foo-bar-321" class="bin-boff"></div' right after match, $1 is defined '</div' line matched '<div id="foo-bar-321" class="bin-boff"> </div' right after match, $1 is defined '</div' line matched '<div id="foo-bar-321" class="bin-boff">foo</div' right after match, $1 is defined '</div' line matched '<div id="foo-bar-321" class="bin-boff"> foo </div' right after match, $1 is defined '</div' line matched '<div id="foo-bar-321" class="bin-boff">' line matched '<div id="foo-bar-321" class="bin-boff"> ' line matched '<div id="foo-bar-321" class="bin-boff">foo' line matched '<div id="foo-bar-321" class="bin-boff"> foo'


      Give a man a fish:  <%-{-{-{-<

        m{ <div (?: (?! </div) .)+ (</div)? }xms)

        Can you please give a brief explanation regarding how the above regex works? It seems to use a few constructs I've never seen before and searching Google for regex symbols doesn't work very well. In particular, is enclosing a regex in 'm()', as you have done above, equivalent to enclosing it in '//'? What is the trailing xms doing?