Pattern matching: Lazy vs. greedy

false_friend has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am trying to do pattern matching, but can’t get the results I am looking for. Here is a very reduced example of what I am trying to do:

My search string is The quick brown fox jumps over the lazy dog, and I am trying to match the lazy dog, but my problem is that I only have limited information about the snipped that I am interested in; I only know the first word (the) and the last word (dog). I tried to accomplish this with a simple lazy regular expression:

#!/usr/bin/perl -w

use strict;

my $string = "The quick brown fox jumps over the lazy dog";
$string =~ /(the .*? dog)/i;
print "Match: '", $1, "'";
[download]

But this gives me

Match: 'The quick brown fox jumps over the lazy dog'
[download]

instead of the desired

Match: 'the lazy dog'
[download]

Is there an elegant way of matching in the ‘laziest’ way (in the sense that a three-word match is lazier than matching the whole string)?

Thank you for you help,

Benedikt

Comment on Pattern matching: Lazy vs. greedy Select or Download Code

Replies are listed 'Best First'.

Re: Pattern matching: Lazy vs. greedy
by Corion (Patriarch) on Mar 30, 2015 at 08:45 UTC

You can stick a greedy quantifier before your non-greedy match. That way the greedy quantifier will eat up as much as it can while the non-greedy part will still match. You seem to call "lazy" what the Perl documentation calls "non-greedy" in perlre.

#!/usr/bin/perl -w

use strict;

my $string = "The quick brown fox jumps over the lazy dog";
$string =~ /.*(the .*? dog)/i;
print "Match: '", $1, "'";

__END__
Match: 'the lazy dog'
[download]

[reply]
[d/l]

Re^2: Pattern matching: Lazy vs. greedy

by false_friend (Novice) on Mar 30, 2015 at 09:14 UTC

Thank you very much!

[reply]

Re: Pattern matching: Lazy vs. greedy
by Athanasius (Cardinal) on Mar 30, 2015 at 09:43 UTC

Hello false_friend, and welcome to the Monastery!

Corion and LanX have answered your specific question, but, in the more general case, you might find it useful to be able to capture all possible matches:

#! perl
use strict;
use warnings;
use Data::Dump;

my  $string  = 'The quick brown fox jumps over the house of the lazy d
+og';
my  @matches = $string =~ /(?=(the .*? dog))/gi;
dd \@matches;
[download]

Output:

19:36 >perl 1202_SoPW.pl
[
  "The quick brown fox jumps over the house of the lazy dog",
  "the house of the lazy dog",
  "the lazy dog",
]

19:37 >
[download]

You could then select the match(es) you want by greping @matches with suitable criteria. On the look-ahead assertion (?=...), see “Look-Around Assertions” in perlre#Extended-Patterns.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Pattern matching: Lazy vs. greedy

by false_friend (Novice) on Mar 30, 2015 at 12:10 UTC

my  @matches = $string =~ /(the .*? dog)/gi;
[download]

(?=

)

(?=)

http://perldoc.perl.org/perlre.html#Extended-Patterns

[reply]
[d/l]

Re^3: Pattern matching: Lazy vs. greedy

by choroba (Cardinal) on Mar 30, 2015 at 12:51 UTC

/(?=(the .*? dog))/gi

zero length

overlapping

لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

[reply]
[d/l]

Re^4: Pattern matching: Lazy vs. greedy

by false_friend (Novice) on Mar 30, 2015 at 17:47 UTC

Re: Pattern matching: Lazy vs. greedy
by LanX (Saint) on Mar 30, 2015 at 08:52 UTC

You could reverse the string , regex and match.

$string =~ /(god .*? eht)/i;

Cheers Rolf
_{(addicted to the Perl Programming Language and ☆☆☆☆ :)}

PS: Je suis Charlie!

[reply]
[d/l]

Re^2: Pattern matching: Lazy vs. greedy

by false_friend (Novice) on Mar 30, 2015 at 09:15 UTC

Thank you also, Rolf. I’ll be excited to see which way is the fastest.

[reply]

Re: Pattern matching: Lazy vs. greedy
by QM (Parson) on Mar 30, 2015 at 11:33 UTC

The black dog danced around the sleeping dog.

...and the endpoints of "the" and "dog", it seems you want the minimal coverage. Here is where you need some test cases to demonstrate what you will and won't accept.

The way the regex engine works, if it starts to match, say on "the", it will exhaust all options before moving on the the next "the".

One example might be the string where the endpoints are not repeated inside the string. But the following doesn't work:

my $first = "the";
my $last = "dog";
my $string = "The black dog danced around the sleeping dog."
my @matches = $string =~ m/\b($first\b(?!.*?$first.*?)\b$last)\b/g;
[download]

Athanasius

solution

my $first = "the";
my $last = "dog";

my @strings = ("The black dog danced around the sleeping dog.",
               "The brown bear leaped over the lazy dog.");

for my $string (@strings) {
    my @match = $string =~ m/(?=\b($first\b.*?\b$last)\b)/gi;

    for my $match (@match) {
        my @firsts = $match =~ m/\b($first)\b/gi;
        my @lasts = $match =~ m/\b($last)\b/gi;
        if ((@firsts == 1) and (@lasts == 1)) {
            print "$match\n";
        }
    }
}

# The black dog
# the sleeping dog
# the lazy dog
[download]

-QM
--
Quantum Mechanics: The dreams stuff is made of

[reply]
[d/l]
[select]

Re^2: Pattern matching: Lazy vs. greedy

by false_friend (Novice) on Mar 30, 2015 at 12:27 UTC

Dear QM, Thank you for your suggestion. In the specific case I am working on here, I can’t categorically rule out repetitions of the first word, but I’ll keep your solution in mind.

[reply]

Re^3: Pattern matching: Lazy vs. greedy

by QM (Parson) on Mar 31, 2015 at 08:46 UTC

... I can’t categorically rule out repetitions of the first word, ...

Can you elaborate on the rules or goals you have in mind?

I would guess something like "shortest matching string" or "string with the smallest number of words" (for some value of $words). It's not necessarily easy to come up with this, but you should be able to list positive and negative examples to help tune the solution.

And most of us are just nerdy enough to want more specifics so we can solve it, or near enough. (Allowing the dreams of examples and counter-examples to be replaced once again by the more familiar nightmares of github DDOSs or Linus rants.)

-QM
--
Quantum Mechanics: The dreams stuff is made of

[reply]

Re: Pattern matching: Lazy vs. greedy
by Anonymous Monk on Mar 30, 2015 at 13:51 UTC

A similar thread with answers that should also be helpful to you: help with lazy matching. It's also important to remember that the regex engine will always match as early as possible (left to right).

[reply]