However, *capturing* just a non-greedy dot star will still suffer from
having to test the remaining pattern (outside of the parens) at each
step. Thus, the negated character class will perform a lot better in
the following:
cc => sub { "this is an amazingly long string" =~ /\s([^l]*)l/ },
ds => sub { "this is an amazingly long string" =~ /\s(.*?)l/ },
However, both approaches *express* different things (they just happen to
functionally coincide in the above). For some things, .*? is the right
approach, for others, a negated character class is the right approach.
And, to add to japhy's additional warning regarding the stricter
meaning of a negated character class, I'll offer another example. For
those who do not see the potential difference in meaning and use of
each approach, consider the following contrived example: I want to
match (and extract) the first two fields of colon separated data, but
only when the third field starts with an 'A' (let's not worry about
whether split() would be a better approach for a minute):
#!/usr/bin/perl -w
use strict;
my %data;
while(<DATA>){
next unless m/^(.*?):(.*?):A/; # non-greedy DS
#next unless m/^([^:]*):([^:]*):A/; # negated CC
$data{$1} = $2;
}
while( my($k,$v) = each %data) {
print "$k => $v\n";
}
__DATA__
abc:123:A:B
def:456:A:C
ghi:789:B:A
jkl:000:C:C
OUTPUT:
non-greedy DS:
abc => 123
def => 456
ghi => 789:B
negated CC:
abc => 123
def => 456
The non-greedy DS version doesn't work according the spec (only the
first two lines have an 'A' in the 3rd field).
That's because dot star part in (.*?): does not say
"match only up to the next colon" (as some people occassionally
believe it does), it says: "match as few (of *any* characters1) as
we can and still have the remainder of the expression match". When
the whole pattern is (.*?):, the end result (aside from
efficiency) is the same --- but if the pattern that follows is more
than a single character, things are not at all the same as a negated
character class.
I only wanted to reiterate this because I've often seen beginners and
more experienced programmer's make the mistake of thinking that the
non-greedy dot star and a negated character class are interchangeable,
and they simply aren't.
[1] well 'any character' except a newline, unless /s
|