Special_K has asked for the wisdom of the Perl Monks concerning the following question:

I am having difficulty creating a lazy match. Here is the code I have:


#!/tool/bin/perl -w use strict; my $test_filename = "/user1/lazy_match_test.txt"; my $match = ""; open(TEST_FILE, $test_filename) || die("ERROR: unable to open $test_fi +lename for read, exiting...\n"); while (<TEST_FILE>) { if ($_ =~ /\/(.+?)$/) { $match = $1; } } close(TEST_FILE); printf("matched $match\n");

Here are the contents of the test file:


/foo/bar/baz/bat

I am trying to write a pattern that captures what follows the last slash in the line.

My understanding is that the ? operator is used to tell perl that the previous pattern should be matched lazily, yet when I run the above code, $match is foo/bar/baz/bat, when I wanted bat.

How do I correctly implement the lazy operator? I know that in the above example there are other ways to capture bat without using a lazy operator, but for the sake of the discussion I would like to learn how the lazy operator works and why it isn't working in this case.

I also tried the following for my regex:

if ($_ =~ /\/(.+)?$/)

But it also matched foo/bar/baz/bat

Replies are listed 'Best First'.
Re: help with lazy matching
by Laurent_R (Canon) on Jan 05, 2015 at 22:57 UTC
    As already mentioned, regexes work from left to right and the regex engine will not backtrack if if succeeds. Your "lazy matching" would work if you wanted to get only the first part of the string, but here, when it gets to the end of the string, it has succeeded and has no reason to take only the end part.

    In theory, you could get around that and use a non-greedy quantifier by first reversing the string and then reversing the result, with this:

    $ perl -E 'my $c = reverse "/foo/bar/baz/bat"; say "matched ", scal +ar reverse ($1) if $c =~ m{(.+?)/};' matched bat
    But that's a bit ugly and unnatural.

    The alternative if to forget about non-greedy quantifier for such a case and use character class, as in the old days where there was no non-greedy quantifier:

    $ perl -E 'my $c = "/foo/bar/baz/bat"; say "matched $1" if $c =~ m{/ +(\w+)$};' matched bat
    or:
    $ perl -E 'my $c = "/foo/bar/baz/bat"; say "matched $1" if $c =~ m{/( +[^/]+)$};' matched bat
    Please also note how I used { } as regex delimiters (theare are many others that you can use, most non-letter or number characters, such as []  (), ##, §§,etc.) so that I did not have to escape the /.

    It was not really mandatory in such a simple example, but it often makes life easier when you have slashes in your regex.

    Update: modified the first code snippet which did not reverse the result. Thanks to AnomalousMonk for pointing out the mistake.

Re: help with lazy matching
by Anonymous Monk on Jan 05, 2015 at 21:56 UTC

    Think about the regex from left to right. It will match on the first slash, then you tell it to match any characters, and then it must match end-of-string/line. So from the regex engine's point of view, it's completed the match - indeed, you can see this if you run this from the command line: "perl -Mre=debug -wMstrict -le '"/foo/bar/baz/bat"=~/\/(.+?)$/; print $1'". The quickest fix I can think of off the top of my head is to change your dot (.) to [^\/].

    The ? would be applicable in the case when your regex wasn't anchored to the end of the string, for example:

    $ perl -wMstrict -le '"/foo/bar/baz/bat"=~/\/(.+)\//; print $1' foo/bar/baz $ perl -wMstrict -le '"/foo/bar/baz/bat"=~/\/(.+?)\//; print $1' foo

    Also: Is /foo/bar/baz/bat supposed to be a filename? Because if yes, I would really strongly recommend that you use fileparse from File::Basename; there are a few other possible modules but this one is in the core so it should always be available. For example:

    use File::Basename 'fileparse'; my $filename = fileparse("/foo/bar/baz/bat"); print "$filename\n"; __END__ bat

    And by the way, I think the ? is more commonly referred to as making the expression "non-greedy".

      ++ on this comment, except that I don't think you need to escape the slash inside the negated character class. You can just write [^/] .

      Also (for the original post), you can leave out the '$_ =~' from your if statement if you want. Since there is no explicit variable in your while loop test, the if statement will match $_ by default.

      --Nick

        You need to escape the forward-slash only if this character is used as the regex delimiter character.

        c:\@Work\Perl\monks>perl -wMstrict -le "$_ = '/foo/bar/baz/bat'; print qq{'$1'} if /([^/]+)$/; " Unmatched [ in regex; marked by <-- HERE in m/([ <-- HERE ^/ at ... c:\@Work\Perl\monks>perl -wMstrict -le "$_ = '/foo/bar/baz/bat'; print qq{'$1'} if m{([^/]+)$}; " 'bat'


        Give a man a fish:  <%-(-(-(-<

      I guess my thinking was that with a non-greedy modifier, my regular expression could use the slash before "bat" to match the slash, then it would match "bat" as the .+, and then finally it would match the end of line character in the file as the $.

      Why does it not work that way?

        The non-greedy modifier simply means "match as little as possible while still getting a successful match". All regex matches in Perl Compatible Regular Expressions always match leftmost first; in your case the first slash. Where the non-greedy operator might have worked, for example, is if you wanted to only match 'foo'. Then you could write:

        if ( /\/(.+?)\// )

        This will match the first slash, then non-greedily match any other characters until another slash is reached. If you didn't use the non-greedy modifier here, you would match everything between the first and last slash (i.e. 'foo/bar/baz').

        --Nick

        I like the description in the Camel:

        ... regular expressions will try to match as early as possible. This even takes precedence over being greedy. Since scanning happens left to right, the pattern will match as far left as possible, even if there is some other place where it could match longer. (Regular expressions may be greedy, but they aren’t into delayed gratification.) ...

        (copied from the free sample material on the O'Reilly website, http://cdn.oreillystatic.com/oreilly/booksamplers/9780596004927_sampler.pdf, book page 44)

        Another key thing to realize is that the $ does not change the behavior to scanning from right-to-left.

        Why does it not work that way?

        the regex metacharacter dot (.) means match any character ( except newline or including newline)

        it starts to match after the first / is matched and it matches all subsequent /

        This is a FAQ but hard to search for FAQ :)

        use re 'debug'; and watch it work

        use rxrx and watch it work

Re: help with lazy matching
by Anonymous Monk on Jan 05, 2015 at 21:38 UTC
    $ perl -e "use Path::Tiny; print path( q{ro/sham/bo} )->basename " $ perl -Mre=debug -e " $_ = q{ro/sham/bo} ; print m{/([^/]+?)$} " Compiling REx "/([^/]+?)$" Final program: 1: EXACT </> (3) 3: OPEN1 (5) 5: MINMOD (6) 6: PLUS (18) 7: ANYOF[\x00-.0-\xff][{unicode_all}] (0) 18: CLOSE1 (20) 20: EOL (21) 21: END (0) anchored "/" at 0 floating ""$ at 2..2147483647 (checking anchored) mi +nlen 2 Guessing start of match in sv for REx "/([^/]+?)$" against "ro/sham/bo +" Found anchored substr "/" at offset 2... Found floating substr ""$ at offset 10... Starting position does not contradict /^/m... Guessed: match at offset 2 Matching REx "/([^/]+?)$" against "/sham/bo" 2 <ro> </sham/bo> | 1:EXACT </>(3) 3 <ro/> <sham/bo> | 3:OPEN1(5) 3 <ro/> <sham/bo> | 5:MINMOD(6) 3 <ro/> <sham/bo> | 6:PLUS(18) ANYOF[\x00-.0-\xff][{unicode_all}] c +an match 1 times out of 1... 4 <ro/s> <ham/bo> | 18: CLOSE1(20) 4 <ro/s> <ham/bo> | 20: EOL(21) failed... ANYOF[\x00-.0-\xff][{unicode_all}] c +an match 1 times out of 1... 5 <ro/sh> <am/bo> | 18: CLOSE1(20) 5 <ro/sh> <am/bo> | 20: EOL(21) failed... ANYOF[\x00-.0-\xff][{unicode_all}] c +an match 1 times out of 1... 6 <ro/sha> <m/bo> | 18: CLOSE1(20) 6 <ro/sha> <m/bo> | 20: EOL(21) failed... ANYOF[\x00-.0-\xff][{unicode_all}] c +an match 1 times out of 1... 7 <ro/sham> </bo> | 18: CLOSE1(20) 7 <ro/sham> </bo> | 20: EOL(21) failed... ANYOF[\x00-.0-\xff][{unicode_all}] c +an match 0 times out of 1... failed... 7 <ro/sham> </bo> | 1:EXACT </>(3) 8 <ro/sham/> <bo> | 3:OPEN1(5) 8 <ro/sham/> <bo> | 5:MINMOD(6) 8 <ro/sham/> <bo> | 6:PLUS(18) ANYOF[\x00-.0-\xff][{unicode_all}] c +an match 1 times out of 1... 9 <ro/sham/b> <o> | 18: CLOSE1(20) 9 <ro/sham/b> <o> | 20: EOL(21) failed... ANYOF[\x00-.0-\xff][{unicode_all}] c +an match 1 times out of 1... 10 <ro/sham/bo> <> | 18: CLOSE1(20) 10 <ro/sham/bo> <> | 20: EOL(21) 10 <ro/sham/bo> <> | 21: END(0) Match successful! boFreeing REx: "/([^/]+?)$" $
Re: help with lazy matching
by Anonymous Monk on Jan 05, 2015 at 22:33 UTC
    for the sake of the discussion I would like to learn how the lazy operator works and why it isn't working in this case

    Conceptually (disregarding optimizations, for instance), this is how it works:

    m{ / (.+?) $ }x # for readability
    It will try the string character by character. So, first of all, it will try to match 'forward slash'. That will match
    / # matched so far

    Then, the expression .+? is really the same as ..*? So the regex engine will try to match any character except newline (for the first dot). That will match

    /f # matched so far

    Then, it will come to a choice. .* means '0 or more'. First of all, the engine will save its state. And it will try to match nothing for .* That will match (empty match always matches)

    /f # matched so far; decision point is saved

    Then, it will try to match 'dollar' - end of string or just before the newline at the end. That will fail, because the end of the string won't be reached yet.

    Then, it will backtrack - the engine will load the previous 'saved state' and will try the other decision in an attempt to match. It will try to match something (rather than the empty string). That will match.

    /fo # matched so far
    Then, it will save its state and try to match nothing, which will be successfull
    /fo # matched so far, decision point saved
    Then it will try to match the end of line again. In case of failure, it will reload the previous state and try to match something instead
    /foo #matched so far
    The engine will keep doing that, alternating between decisions, until it'll reach the end of line.

    Considerable optimizations are possible here, as you might have noticed (and Perl's engine is heavily optimized). But, in principle, this is how it should work for the kind of a regex engine that Perl uses