false_friend has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am trying to do pattern matching, but can’t get the results I am looking for. Here is a very reduced example of what I am trying to do:

My search string is The quick brown fox jumps over the lazy dog, and I am trying to match the lazy dog, but my problem is that I only have limited information about the snipped that I am interested in; I only know the first word (the) and the last word (dog). I tried to accomplish this with a simple lazy regular expression:

#!/usr/bin/perl -w use strict; my $string = "The quick brown fox jumps over the lazy dog"; $string =~ /(the .*? dog)/i; print "Match: '", $1, "'";

But this gives me

Match: 'The quick brown fox jumps over the lazy dog'

instead of the desired

Match: 'the lazy dog'

Is there an elegant way of matching in the ‘laziest’ way (in the sense that a three-word match is lazier than matching the whole string)?

Thank you for you help,

Benedikt

Replies are listed 'Best First'.
Re: Pattern matching: Lazy vs. greedy
by Corion (Patriarch) on Mar 30, 2015 at 08:45 UTC

    You can stick a greedy quantifier before your non-greedy match. That way the greedy quantifier will eat up as much as it can while the non-greedy part will still match. You seem to call "lazy" what the Perl documentation calls "non-greedy" in perlre.

    #!/usr/bin/perl -w use strict; my $string = "The quick brown fox jumps over the lazy dog"; $string =~ /.*(the .*? dog)/i; print "Match: '", $1, "'"; __END__ Match: 'the lazy dog'
      Thank you very much!
Re: Pattern matching: Lazy vs. greedy
by Athanasius (Archbishop) on Mar 30, 2015 at 09:43 UTC

    Hello false_friend, and welcome to the Monastery!

    Corion and LanX have answered your specific question, but, in the more general case, you might find it useful to be able to capture all possible matches:

    #! perl use strict; use warnings; use Data::Dump; my $string = 'The quick brown fox jumps over the house of the lazy d +og'; my @matches = $string =~ /(?=(the .*? dog))/gi; dd \@matches;

    Output:

    19:36 >perl 1202_SoPW.pl [ "The quick brown fox jumps over the house of the lazy dog", "the house of the lazy dog", "the lazy dog", ] 19:37 >

    You could then select the match(es) you want by greping @matches with suitable criteria. On the look-ahead assertion (?=...), see “Look-Around Assertions” in perlre#Extended-Patterns.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Hello Athanasius, Thank you for your answer. I tried something like
      my @matches = $string =~ /(the .*? dog)/gi;
      before, which is without the (?= before the regex and the ) after it. This didn’t get me anywhere. I thought I was familiar with look-around assertions, but to be honest, I don’t get what the (?=) is doing here. I thought it is only used as a modification to something preceding it. I consulted http://perldoc.perl.org/perlre.html#Extended-Patterns but could not find why the two variants give different results. Could you explain that to me?
        The point of using /(?=(the .*? dog))/gi here is that look around assertions are zero length, i.e. after they match at position P, the next match is not searched at their end, but at the position P + 1, so you can find overlapping matches, as well.
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Pattern matching: Lazy vs. greedy
by LanX (Saint) on Mar 30, 2015 at 08:52 UTC
    problem is that you want to anchor on "dog" not "the".

    You could reverse the string , regex and match.

    $string =~ /(god .*? eht)/i;

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)

    PS: Je suis Charlie!

      Thank you also, Rolf. I’ll be excited to see which way is the fastest.
Re: Pattern matching: Lazy vs. greedy
by QM (Parson) on Mar 30, 2015 at 11:33 UTC
    Your problem statement doesn't say which "the" and which "dog" you are interested in. For instance, given the input:
    The black dog danced around the sleeping dog.

    ...and the endpoints of "the" and "dog", it seems you want the minimal coverage. Here is where you need some test cases to demonstrate what you will and won't accept.

    The way the regex engine works, if it starts to match, say on "the", it will exhaust all options before moving on the the next "the".

    One example might be the string where the endpoints are not repeated inside the string. But the following doesn't work:

    my $first = "the"; my $last = "dog"; my $string = "The black dog danced around the sleeping dog." my @matches = $string =~ m/\b($first\b(?!.*?$first.*?)\b$last)\b/g;
    There doesn't seem to be a good way to say "I don't want $first anywhere in this part", except to do another match. Combine this with Athanasius's solution:
    my $first = "the"; my $last = "dog"; my @strings = ("The black dog danced around the sleeping dog.", "The brown bear leaped over the lazy dog."); for my $string (@strings) { my @match = $string =~ m/(?=\b($first\b.*?\b$last)\b)/gi; for my $match (@match) { my @firsts = $match =~ m/\b($first)\b/gi; my @lasts = $match =~ m/\b($last)\b/gi; if ((@firsts == 1) and (@lasts == 1)) { print "$match\n"; } } } # The black dog # the sleeping dog # the lazy dog

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      Dear QM, Thank you for your suggestion. In the specific case I am working on here, I can’t categorically rule out repetitions of the first word, but I’ll keep your solution in mind.
        ... I can’t categorically rule out repetitions of the first word, ...

        Can you elaborate on the rules or goals you have in mind?

        I would guess something like "shortest matching string" or "string with the smallest number of words" (for some value of $words). It's not necessarily easy to come up with this, but you should be able to list positive and negative examples to help tune the solution.

        And most of us are just nerdy enough to want more specifics so we can solve it, or near enough. (Allowing the dreams of examples and counter-examples to be replaced once again by the more familiar nightmares of github DDOSs or Linus rants.)

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

Re: Pattern matching: Lazy vs. greedy
by Anonymous Monk on Mar 30, 2015 at 13:51 UTC

    A similar thread with answers that should also be helpful to you: help with lazy matching. It's also important to remember that the regex engine will always match as early as possible (left to right).