Ovid, Long Live .*? (dot star question-mark)

Replies are listed 'Best First'.
Re: Ovid, Long Live .*? (dot star question-mark) by blakem (Monsignor) on Sep 06, 2001 at 05:14 UTC
Thats great! I always thought the non-greedy dot-star was so much more intuitive than a negated character class. Especially since the char in question is likely to be one of the "funny looking" ones (i.e. \W). I've cringed a few times when replacing `<.?>` with `<[^>]>` in a few scripts I've inherited. I know it's the right thing to do, but IMHO it makes the code less readable. Its good to know that the rules have changed for 5.6. -Blake	[reply] [d/l] [select]
Re: Re: Ovid, Long Live .*? (dot star question-mark) by danger (Priest) on Sep 06, 2001 at 05:46 UTC
However, capturing just a non-greedy dot star will still suffer from having to test the remaining pattern (outside of the parens) at each step. Thus, the negated character class will perform a lot better in the following: `cc => sub { "this is an amazingly long string" =~ /\s([^l])l/ }, ds => sub { "this is an amazingly long string" =~ /\s(.?)l/ },` [download] However, both approaches express different things (they just happen to functionally coincide in the above). For some things, .? is the right approach, for others, a negated character class is the right approach. And, to add to japhy's additional warning regarding the stricter meaning of a negated character class, I'll offer another example. For those who do not see the potential difference in meaning and use of each approach, consider the following contrived example: I want to match (and extract) the first two fields of colon separated data, but only when the third field starts with an 'A' (let's not worry about whether split() would be a better approach for a minute): `#!/usr/bin/perl -w use strict; my %data; while(<DATA>){ next unless m/^(.?):(.?):A/; # non-greedy DS #next unless m/^([^:]):([^:]):A/; # negated CC $data{$1} = $2; } while( my($k,$v) = each %data) { print "$k => $v\n"; } __DATA__ abc:123:A:B def:456:A:C ghi:789:B:A jkl:000:C:C OUTPUT: non-greedy DS: abc => 123 def => 456 ghi => 789:B negated CC: abc => 123 def => 456` [download] The non-greedy DS version doesn't work according the spec (only the first two lines have an 'A' in the 3rd field). That's because dot star part in `(.?):` does not say "match only up to the next colon" (as some people occassionally believe it does), it says: "match as few (of any characters1) as we can and still have the remainder of the expression match". When the whole pattern is `(.*?):`, the end result (aside from efficiency) is the same --- but if the pattern that follows is more than a single character, things are not at all the same as a negated character class. I only wanted to reiterate this because I've often seen beginners and more experienced programmer's make the mistake of thinking that the non-greedy dot star and a negated character class are interchangeable, and they simply aren't. `[1] well 'any character' except a newline, unless /s` [download]	[reply] [d/l] [select]
Re: Re: Re: Ovid, Long Live .*? (dot star question-mark) by japhy (Canon) on Sep 06, 2001 at 05:56 UTC
Wow. That sucks about capturing. I guess the engine is really specific about that optimization. There's got to be a reason that capturing is so bad like that. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker. `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply]
Re (tilly) 4: Ovid, Long Live .*? (dot star question-mark) by tilly (Archbishop) on Sep 06, 2001 at 18:10 UTC
Capturing .* and .*? is fast now. :) by japhy (Canon) on Sep 06, 2001 at 20:32 UTC
I found out why captuuring them does not activate the optimisation. It has to do with the regex tree. A regex like `/.?foo/` becomes: `1 MINMOD(2) 2 STAR(4) 3 REG_ANY(0) 4 EXACT <foo>(6) 6 END` [download] whereas the tree for `/(.?)foo/` is: `1 OPEN1(3) 3 MINMOD(4) 4 STAR(6) 5 REG_ANY(0) 6 CLOSE1(8) 8 EXACT <foo>(10) 10 END` [download] In the non-capturing version, the node directly after STAR is EXACT. In the capturing version, the node is CLOSE. I patched `regexec.c` to allow for a CLOSE in between the two. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker. `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l] [select]
Re: Re: Ovid, Long Live .*? (dot star question-mark) by japhy (Canon) on Sep 06, 2001 at 05:20 UTC
It should be warned, though, that using a negated character class is FAR stricter than a `.?` in the regex. Code like `/a.?abc/` can match an "a" in the `.?` if it has to. A regex of `/a[^a]abc/` can not. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker. `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply]
(Ovid) Re: Ovid, Long Live .*? (dot star question-mark) by Ovid (Cardinal) on Sep 06, 2001 at 20:24 UTC
Well, my name's in the title, I have to respond, right? :) Are you sure that's 5.6? I was playing around with that when I saw your post on the beginner's list. I've gotten the following: `cmpthese(-3, { cc => sub { 'asdfasdfXasdfasdfXsdfasdfXasdfasdf111XYZ' =~ /(?:[^ +X]X)+?YZ/ }, ds => sub { 'asdfasdfXasdfasdfXsdfasdfXasdfasdf111XYZ' =~ /.?XY +Z/ }, });` [download] Results: `Benchmark: running cc, ds, each for at least 3 CPU seconds... cc: 4 wallclock secs ( 3.12 usr + 0.00 sys = 3.12 CPU) +@ 101652.97/s (n=316649) ds: 4 wallclock secs ( 3.12 usr + 0.00 sys = 3.12 CPU) +@ 243115.24/s (n=759492) Rate cc ds cc 101653/s -- -58% ds 243115/s 139% --` [download] And the version: `C:\>perl -v This is perl, v5.6.0 built for MSWin32-x86-multi-thread (with 1 registered patch, see perl -V for more detail) Copyright 1987-2000, Larry Wall Binary build 620 provided by ActiveState Tool Corp. http:/ +/www.ActiveState.com Built 18:31:05 Oct 31 2000` [download] Update: I upgraded to 5.6.1, build 629 and am getting similar results: `Benchmark: running cc, ds, each for at least 3 CPU seconds... cc: 3 wallclock secs ( 3.19 usr + 0.01 sys = 3.20 CPU) +@ 116044.62/s (n=371923) ds: 3 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) +@ 250170.39/s (n=784034) Rate cc ds cc 116045/s -- -54% ds 250170/s 116% --` [download] Just to be safe, I used your exact `cmpthese` code above: `Benchmark: running cc, ds, each for at least 3 CPU seconds... cc: 2 wallclock secs ( 3.08 usr + 0.01 sys = 3.09 CPU) +@ 319288.07/s (n=987558) ds: 3 wallclock secs ( 3.19 usr + -0.01 sys = 3.18 CPU) +@ 292251.18/s (n=930820) Rate ds cc ds 292251/s -- -8% cc 319288/s 9% --` [download] After running it several times to get consistent results, I see that the optimization still seems to be dependant on the text one is matching against. Am I just missing something obvious here? Do the 'non-capturing' parens throw it off? My original code comes from your reply on the beginner's list. Cheers, Ovid P.S.: if you're one of the three Perlmonks who doesn't "get" the reference in japhy's title, see Death to Dot Star! Vote for paco! Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.	[reply] [d/l] [select]
Re: (Ovid) Re: Ovid, Long Live .*? (dot star question-mark) by japhy (Canon) on Sep 06, 2001 at 20:53 UTC
Ok. Allow me to explain how the optimization works. If a quantifier is proceeded by an exact character, then the quantifier knows how far ahead it can jump safely, so it doesn't have to match character-by-character. If the quantifier is inside capturing parens, this optimization is not on, because the node after the quantifier is not an exact character, but the node for signalling the closing of the paren. My optimization tells perl to look beyond a closing paren. Technically, it should also look past ANY closing parens (not just one). And perhaps code evaluations as well... The optimization does depend on the frequency of that "jump ahead" character. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker. `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply]