help with lazy matching

Special_K has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: help with lazy matching by Laurent_R (Canon) on Jan 05, 2015 at 22:57 UTC
As already mentioned, regexes work from left to right and the regex engine will not backtrack if if succeeds. Your "lazy matching" would work if you wanted to get only the first part of the string, but here, when it gets to the end of the string, it has succeeded and has no reason to take only the end part. In theory, you could get around that and use a non-greedy quantifier by first reversing the string and then reversing the result, with this: `$ perl -E 'my $c = reverse "/foo/bar/baz/bat"; say "matched ", scal +ar reverse ($1) if $c =~ m{(.+?)/};' matched bat` [download] But that's a bit ugly and unnatural. The alternative if to forget about non-greedy quantifier for such a case and use character class, as in the old days where there was no non-greedy quantifier: `$ perl -E 'my $c = "/foo/bar/baz/bat"; say "matched $1" if $c =~ m{/ +(\w+)$};' matched bat` [download] or: `$ perl -E 'my $c = "/foo/bar/baz/bat"; say "matched $1" if $c =~ m{/( +[^/]+)$};' matched bat` [download] Please also note how I used `{ }` as regex delimiters (theare are many others that you can use, most non-letter or number characters, such as `[] (), ##, งง,`etc.) so that I did not have to escape the `/`. It was not really mandatory in such a simple example, but it often makes life easier when you have slashes in your regex. Update: modified the first code snippet which did not reverse the result. Thanks to AnomalousMonk for pointing out the mistake.	[reply] [d/l] [select]
Re: help with lazy matching by Anonymous Monk on Jan 05, 2015 at 21:56 UTC
Think about the regex from left to right. It will match on the first slash, then you tell it to match any characters, and then it must match end-of-string/line. So from the regex engine's point of view, it's completed the match - indeed, you can see this if you run this from the command line: "`perl -Mre=debug -wMstrict -le '"/foo/bar/baz/bat"=~/\/(.+?)$/; print $1'`". The quickest fix I can think of off the top of my head is to change your dot (`.`) to `[^\/]`. The `?` would be applicable in the case when your regex wasn't anchored to the end of the string, for example: `$ perl -wMstrict -le '"/foo/bar/baz/bat"=~/\/(.+)\//; print $1' foo/bar/baz $ perl -wMstrict -le '"/foo/bar/baz/bat"=~/\/(.+?)\//; print $1' foo` [download] Also: Is `/foo/bar/baz/bat` supposed to be a filename? Because if yes, I would really strongly recommend that you use `fileparse` from File::Basename; there are a few other possible modules but this one is in the core so it should always be available. For example: `use File::Basename 'fileparse'; my $filename = fileparse("/foo/bar/baz/bat"); print "$filename\n"; __END__ bat` [download] And by the way, I think the `?` is more commonly referred to as making the expression "non-greedy".	[reply] [d/l] [select]
Re^2: help with lazy matching by nlwhittle (Beadle) on Jan 05, 2015 at 22:19 UTC
++ on this comment, except that I don't think you need to escape the slash inside the negated character class. You can just write [^/] . Also (for the original post), you can leave out the '$_ =~' from your if statement if you want. Since there is no explicit variable in your while loop test, the if statement will match $_ by default. --Nick	[reply]
Re^3: help with lazy matching by AnomalousMonk (Archbishop) on Jan 05, 2015 at 23:09 UTC
You need to escape the forward-slash only if this character is used as the regex delimiter character. `c:\@Work\Perl\monks>perl -wMstrict -le "$_ = '/foo/bar/baz/bat'; print qq{'$1'} if /([^/]+)$/; " Unmatched [ in regex; marked by <-- HERE in m/([ <-- HERE ^/ at ... c:\@Work\Perl\monks>perl -wMstrict -le "$_ = '/foo/bar/baz/bat'; print qq{'$1'} if m{([^/]+)$}; " 'bat'` [download] Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]
Re^2: help with lazy matching by Special_K (Pilgrim) on Jan 05, 2015 at 22:17 UTC
I guess my thinking was that with a non-greedy modifier, my regular expression could use the slash before "bat" to match the slash, then it would match "bat" as the .+, and then finally it would match the end of line character in the file as the $. Why does it not work that way?	[reply]
Re^3: help with lazy matching by nlwhittle (Beadle) on Jan 05, 2015 at 22:27 UTC
The non-greedy modifier simply means "match as little as possible while still getting a successful match". All regex matches in Perl Compatible Regular Expressions always match leftmost first; in your case the first slash. Where the non-greedy operator might have worked, for example, is if you wanted to only match 'foo'. Then you could write: `if ( /\/(.+?)\// )` This will match the first slash, then non-greedily match any other characters until another slash is reached. If you didn't use the non-greedy modifier here, you would match everything between the first and last slash (i.e. 'foo/bar/baz'). --Nick	[reply] [d/l]
Re^4: help with lazy matching by Special_K (Pilgrim) on Jan 05, 2015 at 23:07 UTC
Re^5: help with lazy matching by davido (Cardinal) on Jan 05, 2015 at 23:51 UTC
Re^3: help with lazy matching by Anonymous Monk on Jan 05, 2015 at 22:58 UTC
I like the description in the Camel: ... regular expressions will try to match as early as possible. This even takes precedence over being greedy. Since scanning happens left to right, the pattern will match as far left as possible, even if there is some other place where it could match longer. (Regular expressions may be greedy, but they aren’t into delayed gratification.) ... (copied from the free sample material on the O'Reilly website, `http://cdn.oreillystatic.com/oreilly/booksamplers/9780596004927_sampler.pdf`, book page 44) Another key thing to realize is that the `$` does not change the behavior to scanning from right-to-left.	[reply] [d/l] [select]
Re^3: help with lazy matching ( .+? versus [^/]+? rxrx -Mre=debug ) by Anonymous Monk on Jan 05, 2015 at 22:37 UTC
Why does it not work that way? the regex metacharacter dot (.) means match any character ( except newline or including newline) it starts to match after the first / is matched and it matches all subsequent / This is a FAQ but hard to search for FAQ :) use re 'debug'; and watch it work Read more... (3 kB) use rxrx and watch it work Read more... (2 kB)	[reply] [d/l] [select]
Re: help with lazy matching by Anonymous Monk on Jan 05, 2015 at 21:38 UTC
$ perl -e "use Path::Tiny; print path( q{ro/sham/bo} )->basename " $ perl -Mre=debug -e " $_ = q{ro/sham/bo} ; print m{/([^/]+?)$} " Compiling REx "/([^/]+?)$" Final program: 1: EXACT </> (3) 3: OPEN1 (5) 5: MINMOD (6) 6: PLUS (18) 7: ANYOF[\x00-.0-\xff][{unicode_all}] (0) 18: CLOSE1 (20) 20: EOL (21) 21: END (0) anchored "/" at 0 floating ""$ at 2..2147483647 (checking anchored) mi +nlen 2 Guessing start of match in sv for REx "/([^/]+?)$" against "ro/sham/bo +" Found anchored substr "/" at offset 2... Found floating substr ""$ at offset 10... Starting position does not contradict /^/m... Guessed: match at offset 2 Matching REx "/([^/]+?)$" against "/sham/bo" 2 <ro> </sham/bo> \| 1:EXACT </>(3) 3 <ro/> <sham/bo> \| 3:OPEN1(5) 3 <ro/> <sham/bo> \| 5:MINMOD(6) 3 <ro/> <sham/bo> \| 6:PLUS(18) ANYOF[\x00-.0-\xff][{unicode_all}] c +an match 1 times out of 1... 4 <ro/s> <ham/bo> \| 18: CLOSE1(20) 4 <ro/s> <ham/bo> \| 20: EOL(21) failed... ANYOF[\x00-.0-\xff][{unicode_all}] c +an match 1 times out of 1... 5 <ro/sh> <am/bo> \| 18: CLOSE1(20) 5 <ro/sh> <am/bo> \| 20: EOL(21) failed... ANYOF[\x00-.0-\xff][{unicode_all}] c +an match 1 times out of 1... 6 <ro/sha> <m/bo> \| 18: CLOSE1(20) 6 <ro/sha> <m/bo> \| 20: EOL(21) failed... ANYOF[\x00-.0-\xff][{unicode_all}] c +an match 1 times out of 1... 7 <ro/sham> </bo> \| 18: CLOSE1(20) 7 <ro/sham> </bo> \| 20: EOL(21) failed... ANYOF[\x00-.0-\xff][{unicode_all}] c +an match 0 times out of 1... failed... 7 <ro/sham> </bo> \| 1:EXACT </>(3) 8 <ro/sham/> <bo> \| 3:OPEN1(5) 8 <ro/sham/> <bo> \| 5:MINMOD(6) 8 <ro/sham/> <bo> \| 6:PLUS(18) ANYOF[\x00-.0-\xff][{unicode_all}] c +an match 1 times out of 1... 9 <ro/sham/b> <o> \| 18: CLOSE1(20) 9 <ro/sham/b> <o> \| 20: EOL(21) failed... ANYOF[\x00-.0-\xff][{unicode_all}] c +an match 1 times out of 1... 10 <ro/sham/bo> <> \| 18: CLOSE1(20) 10 <ro/sham/bo> <> \| 20: EOL(21) 10 <ro/sham/bo> <> \| 21: END(0) Match successful! boFreeing REx: "/([^/]+?)$" $ [download]	[reply] [d/l]
Re: help with lazy matching by Anonymous Monk on Jan 05, 2015 at 22:33 UTC
for the sake of the discussion I would like to learn how the lazy operator works and why it isn't working in this case Conceptually (disregarding optimizations, for instance), this is how it works: `m{ / (.+?) $ }x # for readability` [download] It will try the string character by character. So, first of all, it will try to match 'forward slash'. That will match `/ # matched so far` [download] Then, the expression .+? is really the same as ..? So the regex engine will try to match any character except newline (for the first dot). That will match `/f # matched so far` [download] Then, it will come to a choice. . means '0 or more'. First of all, the engine will save its state. And it will try to match nothing for .* That will match (empty match always matches) `/f # matched so far; decision point is saved` [download] Then, it will try to match 'dollar' - end of string or just before the newline at the end. That will fail, because the end of the string won't be reached yet. Then, it will backtrack - the engine will load the previous 'saved state' and will try the other decision in an attempt to match. It will try to match something (rather than the empty string). That will match. `/fo # matched so far` [download] Then, it will save its state and try to match nothing, which will be successfull `/fo # matched so far, decision point saved` [download] Then it will try to match the end of line again. In case of failure, it will reload the previous state and try to match something instead `/foo #matched so far` [download] The engine will keep doing that, alternating between decisions, until it'll reach the end of line. Considerable optimizations are possible here, as you might have noticed (and Perl's engine is heavily optimized). But, in principle, this is how it should work for the kind of a regex engine that Perl uses	[reply] [d/l] [select]