An "ethical" use of dot-star ..?

Monks,

I've recently been reading with interest some of the previous discussions on the use of .* and .*? that are scattered around the Monastery (Death to Dot Star!, Dot star okay, or not? and Ovid, Long Live .*? (dot star question-mark), among others). These have gone into why .* and its friends are considered bad, and I think I understand the reasoning behind this point of view.

I've recently had to write some code at work, though, which got me thinking about this. The code is simple enough - it parses some XML tags to grab data from a file. (Aside: See Production Environments and "Foreign" Code for why I can't just use the XML modules, which I'd much rather do). The code, however, implements a regex to grab the data from the file - and uses as part of this the dreaded .*, albeit in a non-greedy fashion.

I've thought about this long and hard, and I don't think that I can see a straightforward, easy-to-read way of implementing the same code without the .*?, for which the regex I wrote and an example are below.

my $example = "<ClientID type="String">A1234BX</ClientID>";
$example =~ /^\s*\<(\w+)\s[\w\"\=]+\>(.*?)\<\//;

my $tag  = $1;
my $data = $2;

# do something with the data
[download]

I've considered using character classes and look-aheads to pull the data between the two XML tags (which can include a wide and interesting array of alphanumeric and other characters), but I can't see how these would be either beneficial or efficient for a large set of data.

I guess I'm interested to know what the general consensus for the use of .* is. Is it something to be avoided at all costs, or is it a powerful, oft-misused tool that can be useful and beneficial in carefully controlled circumstances?

While I'm at it *grin*, does anyone have a "better idea" for pulling the data out of the tags? Would this count as an acceptable exception to the "Don't Use Dot Star" rule that seems to be prevalent throughout the Monastery?

Any opinions, suggestions and comments are welcome :)

-- Foxcub
#include www.liquidfusion.org.uk

Comment on An "ethical" use of dot-star ..? Select or Download Code

Replies are listed 'Best First'.

Re: An "ethical" use of dot-star ..?
by broquaint (Abbot) on Jun 02, 2003 at 14:12 UTC

I guess I'm interested to know what the general consensus for the use of .* is. Is it something to be avoided at all costs, or is it a powerful, oft-misused tool that can be useful and beneficial in carefully controlled circumstances?

.*

.*?

$'

While I'm at it *grin*, does anyone have a "better idea" for pulling the data out of the tags?

## *very* simplistic stuff (e.g doesn't deal with nested tags)

my $token     = qr{ (?: \b [A-Z]\w+ \b ) }xi;
my $attrib    = qr{ (?: $token \s* = \s* "[^"]+" \s* ) }x;
my $begin_tag = qr{ < ( $token ) \s* ( $attrib* ) > }x;
my $end_tag   = qr{ </$token> }x;

my $example = q[<ClientID type="String">A1234BX</ClientID>];

my($tag, $attribs, $data) =
  $example =~ m{ $begin_tag (.*?) $end_tag }x;

print "tag     - $tag\n";
print "attribs - $attribs\n";
print "data    - $data\n";

__output__

tag     - ClientID
attribs - type="String"
data    - A1234BX
[download]

_________ broquaint

Re: Re: An "ethical" use of dot-star ..?

by sauoq (Abbot) on Jun 03, 2003 at 00:23 UTC

My rule of thumb for .* versus .*? is that the former is for grabbing everything after a certain point (I can't be bothered with $') and the latter for grabbing data between 2 points.

I'd say that .*? is most often useful when grabbing things between two points and the second point is defined by a string of more than one character. If the right hand side can be recognized by a single character I'd suggest a negated character class instead. For example, I'd almost alway prefer using /[^x]*/ to using /.*?x/ because the former is explicit in its exclusion of x's. :-)

-sauoq
"My two cents aren't worth a dime.";

Re: An "ethical" use of dot-star ..?
by fglock (Vicar) on Jun 02, 2003 at 13:45 UTC

You could use this construct, instead of the dot-star:

([^<]*)

Re: Re: An "ethical" use of dot-star ..?

by Tanalis (Curate) on Jun 02, 2003 at 14:26 UTC

What does that gain, though, over .*?? Is it more efficient, or is it simply avoiding the "problem" by coding round it?

Also, to reliably detect a closing tag, you need to match </. There's nothing to stop "<" from appearing in the data (in fact, it's likely for limits we impose locally on credit). The .* construct would have correctly matched "<" without terminating, and continued to match up until it found the </ of a closing tag.

I don't think I'm convinced that the alternative you suggest would have the same effect on the data, and the data that got grabbed as the .*.

-- Foxcub
#include www.liquidfusion.org.uk

[reply]
[d/l]
[select]

Re: Re: Re: An "ethical" use of dot-star ..?

by fglock (Vicar) on Jun 02, 2003 at 14:42 UTC

There is an explanation for why this is slightly better, in Death to Dot Star!.

If you apply the change, the regex will look like:

$example =~ /^\s*\<(\w+)\s[\w\"\=]+\>([^<]*)\<\//;

I think it will not have a different effect on the data.

Re: Re: Re: An "ethical" use of dot-star ..?

by PodMaster (Abbot) on Jun 02, 2003 at 14:51 UTC

MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
** The Third rule of perl club is a statement of fact: pod is sexy.

Re: Re: Re: An "ethical" use of dot-star ..?

by sauoq (Abbot) on Jun 03, 2003 at 00:13 UTC

What does that gain, though, over .*??

It better expresses what you are actually trying to do. (I''m actually not entirely sure that's true in your case, but it may be.)

For one, using [^<]* will match a newline. Your original regex will not. You'd have to use a /s modifier for that.

On the other hand, using [^<]* will simply fail to match on strings like: "<inequality>X < Y</inequality>" but maybe that's fine in your case.

By the way, yours will fail if there is a space between the '<' and the '/' in the end tag. Maybe you knew that though.... if that's what you wanted, it's fine.

And that's really the crux of the matter. There is nothing inherently wrong in using a dot-star. It's just misunderstood so often that it's prudent to warn people about it. The other day, I recommended someone use my ($file, $ext) = /(.*)\.(.*)/; to break a filename into its base and extension. Two dot-stars for the price of one there... but — shrug — it did what he needed. The key is understanding what you need and how best to express it. Don't say "zero or more (but as few as possible) of any character except a newline" when you really mean "as many non-Less-Than characters as possible."

-sauoq
"My two cents aren't worth a dime.";

Re: An "ethical" use of dot-star ..?
by thelenm (Vicar) on Jun 02, 2003 at 16:30 UTC

Would this count as an acceptable exception to the "Don't Use Dot Star" rule that seems to be prevalent throughout the Monastery?

If there is such a rule (I don't know that there is), it would seem to be cargo cult programming to me. One should use the appropriate tool for the job. Most of the time, there is a more correct or more efficient solution than dot-star, but if dot-star (greedy or non-greedy) suits your needs and works correctly, then by all means use it.

-- Mike

-- just,my${.02}

Re: An "ethical" use of dot-star ..?
by BrowserUk (Patriarch) on Jun 02, 2003 at 23:46 UTC

As with most "thou shalt nots" applied to Perls TIMTOWTDI, there is a rational behind them, but just as using the proscribed technique, method or construct blindly without fully understanding what it actually does and the implications that come from it is dangerous, so blindly trying to avoid the proscribed behaviour without understanding the reasoning is equally bad. Maybe more so.

Sometimes the 'grab all you can' behaviour is exactly the semantic that you need. One assumes, that as it is the 'default' behaviour, the people that designed and maintain the regex engine consider this to be the prevelant requirement.

Death to Dot Star! and freinds serve the very useful purpose of highlighting the implications of using the feature in an unconstrained way with regard to the implications. However, dogmatically not using the construct, when it is the right tool for the job is equally bad and 'cargo-cultish'.

Noone would advocate the total removal of the 'rd -r *' facility, despite the very real disasters that can ensue from its incorrect use.

<tongue-firmly-in-cheek>

Maybe Perl should prompt the programmer (or even popup a dialog:) with "Are you sure you want to .*"? when ever it encounters a regex that uses it. Or maybe a new /* regex option is called for that says "Yes, I'm using .* and its ok. I'm a programmer and I know what I'm doing" :)

</tongue-firmly-in-cheek>

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Re: An "ethical" use of dot-star ..?
by Aristotle (Chancellor) on Jun 03, 2003 at 01:01 UTC

.*

Makeshifts last the longest.

Re: Re: An "ethical" use of dot-star ..?

by Anonymous Monk on Jun 03, 2003 at 03:22 UTC

If you understand backtracking, you won't need to ask such questions.

You do? So, explain already.

Many of us don't, and maybe that is because the existing texts are too dense on the subject.

Maybe, given you new-found understanding fresh in your mind, you can put the concept into words that others in your prior condition will be able to follow and assimilate?

Re^3: An "ethical" use of dot-star ..?

by Aristotle (Chancellor) on Jun 03, 2003 at 11:03 UTC

Makeshifts last the longest.

Back to Meditations