If I'm matching a pattern wy does a + sign make things crazy?

SergioQ has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: If I'm matching a pattern wy does a + sign make things crazy?
by Corion (Patriarch) on May 05, 2020 at 08:42 UTC

Can you show us a short example program with input data that allows us to reproduce your situation?

Note that your regular expression is somewhat greedy and will bridge links if they appear in the same block of text:

my $text = <<'HTML';
<a href="https://www.example.com/1">Link 1</a> Some plain text <a href
+="https://www.example.com/2">Link 2</a>
HTML
[download]

If you show us the representative input data you're using and the code you're using and the output you get, then we can better advise how to best match it.

[reply]
[d/l]

Re: If I'm matching a pattern wy does a + sign make things crazy? (updated)
by haukex (Archbishop) on May 05, 2020 at 08:52 UTC

I doubt it has anything to do with a + character, more likely it's to do with the fact that .* is greedy and you've got two links on the same line, as Corion said.

Yes, I realize that Perl regex is not the way to go to find links.

Correct! Sorry, but in regards to "my method works": it's not really working, though. And even if you fix this one issue with this one input, the next problem will certainly pop up - see this node for how complex parsing HTML gets*. In this case, there's a very well established module for that, HTML::LinkExtor. Mojo::DOM also works:

use Mojo::DOM;
my $dom = Mojo::DOM->new(<<'END_HTML');
<a href="https://www.example.com/foodbanks/a">a</a> <a href="https://w
+ww.example.com/foodbanks/b">b</a>
<a href="https://www.example.com/foodbanks/c">c</a>
END_HTML
$dom->find('a[href]')->each(sub {
    print "$_ / ",$_->{href}," / ",$_->all_text,"\n";
});
[download]

Edited second paragraph.

* Update: Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks

[reply]
[d/l]
[select]

Re: If I'm matching a pattern why does a + sign make things crazy?
by Athanasius (Cardinal) on May 06, 2020 at 13:56 UTC

Hello SergioQ,

p.s. The forum adds a red + to lines here that "wrap" so it makes my code look more confusing.

This is configurable, see:

“Code Listing Settings” in Display Settings
New code wrap options

In addition, text formatted as code¹ is displayed with a [download] link at its foot: clicking on this link displays just the code, with no wrapping, in a new window, making it easy to copy-and-paste accurately.

¹This applies when the code is inserted between <code> ... </code> tags on separate lines, but not to inlined code.

So, monks shouldn’t have any trouble distinguishing the + signs in your code from those added due to code wrapping.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: If I'm matching a pattern wy does a + sign make things crazy?
by AnomalousMonk (Archbishop) on May 05, 2020 at 20:52 UTC

Corion and haukex have already referred to the likelihood that in your situation, adding a literal '+' to the match constrains the otherwise "greedy" .* match to stop with the first occurrence of the regex pattern. They have also recommended much more fundamentally robust approaches to solving your problem.

I'd love to know why.

WRT regex mechanics, I hope I can provide a detailed answer to your prayer. As already mentioned, this behavior can be demonstrated using any character (or, indeed, substring) as an explicit "anchor" for the match:

c:\@Work\Perl\monks>perl -wMstrict -le
"my $s = 'xxx xyzzyfooAbar yyy xyzzyzotBbar zzz';
 ;;
 my $match;
 ;;
 print qq{A: .*:   '$match'} if ($match) = $s =~ m{ (xyzzy .*    bar) 
+}xms;
 print qq{B: .* A: '$match'} if ($match) = $s =~ m{ (xyzzy .*  A bar) 
+}xms;
 print qq{C: .*?:  '$match'} if ($match) = $s =~ m{ (xyzzy .*?   bar) 
+}xms;
"
A: .*:   'xyzzyfooAbar yyy xyzzyzotBbar'
B: .* A: 'xyzzyfooAbar'
C: .*?:  'xyzzyfooAbar'
[download]

In example A, the greedy .* match grabs as much as it can (to the end of the string in this case), but then the regex engine backtracks until the first point at which it can match an explicit 'bar' substring. Unfortunately, this gives you a bit more than you want even in the absence of the /g modifier: the regex engine strives for the leftmost, longest match.

In example B, .* still grabs as much as it can (to the end of the string), but then the regex engine backtracks until it can match an explicit 'A' substring. Then matching moves forward again to find the 'bar' substring.

In example C, the "lazy" modifier ? of the .*? match means that it will match as little as possible to achieve an overall match with 'bar'. No backtracking occurs.

Update: Corrected a couple of trivial spelling/formatting errors.

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]

Re: If I'm matching a pattern wy does a + sign make things crazy?
by AnomalousMonk (Archbishop) on May 05, 2020 at 21:57 UTC

($mres) = ($res =~ m/(<a href=\"https:\/\/www.example.com\/foodbanks\/.*<\/a>)/im);

Also WRT regexes: In the code above quoted from the OP, note that the . (dot) operators in www.example.com will match anything (except newlines (unless the /s modifier is asserted)), so e.g.,

c:\@Work\Perl\monks>perl -wMstrict -le
"print 'match' if 'wwwXexampleYcom' =~ /www.example.com/;
"
match
[download]

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]