Monks,

I've recently been reading with interest some of the previous discussions on the use of .* and .*? that are scattered around the Monastery (Death to Dot Star!, Dot star okay, or not? and Ovid, Long Live .*? (dot star question-mark), among others). These have gone into why .* and its friends are considered bad, and I think I understand the reasoning behind this point of view.

I've recently had to write some code at work, though, which got me thinking about this. The code is simple enough - it parses some XML tags to grab data from a file. (Aside: See Production Environments and "Foreign" Code for why I can't just use the XML modules, which I'd much rather do). The code, however, implements a regex to grab the data from the file - and uses as part of this the dreaded .*, albeit in a non-greedy fashion.

I've thought about this long and hard, and I don't think that I can see a straightforward, easy-to-read way of implementing the same code without the .*?, for which the regex I wrote and an example are below.

my $example = "<ClientID type="String">A1234BX</ClientID>"; $example =~ /^\s*\<(\w+)\s[\w\"\=]+\>(.*?)\<\//; my $tag = $1; my $data = $2; # do something with the data
I've considered using character classes and look-aheads to pull the data between the two XML tags (which can include a wide and interesting array of alphanumeric and other characters), but I can't see how these would be either beneficial or efficient for a large set of data.

I guess I'm interested to know what the general consensus for the use of .* is. Is it something to be avoided at all costs, or is it a powerful, oft-misused tool that can be useful and beneficial in carefully controlled circumstances?

While I'm at it *grin*, does anyone have a "better idea" for pulling the data out of the tags? Would this count as an acceptable exception to the "Don't Use Dot Star" rule that seems to be prevalent throughout the Monastery?

Any opinions, suggestions and comments are welcome :)

-- Foxcub
#include www.liquidfusion.org.uk

Replies are listed 'Best First'.
Re: An "ethical" use of dot-star ..?
by broquaint (Abbot) on Jun 02, 2003 at 14:12 UTC
    I guess I'm interested to know what the general consensus for the use of .* is. Is it something to be avoided at all costs, or is it a powerful, oft-misused tool that can be useful and beneficial in carefully controlled circumstances?
    My rule of thumb for .* versus .*? is that the former is for grabbing everything after a certain point (I can't be bothered with $') and the latter for grabbing data between 2 points. So I guess it's a 'powerful oft-misused tool', but that's more due to the fact that people aren't aware of the concept of quantifier greediness.
    While I'm at it *grin*, does anyone have a "better idea" for pulling the data out of the tags?
    Due to the nature of XML it might be a good idea to have more layered regexes e.g
    ## *very* simplistic stuff (e.g doesn't deal with nested tags) my $token = qr{ (?: \b [A-Z]\w+ \b ) }xi; my $attrib = qr{ (?: $token \s* = \s* "[^"]+" \s* ) }x; my $begin_tag = qr{ < ( $token ) \s* ( $attrib* ) > }x; my $end_tag = qr{ </$token> }x; my $example = q[<ClientID type="String">A1234BX</ClientID>]; my($tag, $attribs, $data) = $example =~ m{ $begin_tag (.*?) $end_tag }x; print "tag - $tag\n"; print "attribs - $attribs\n"; print "data - $data\n"; __output__ tag - ClientID attribs - type="String" data - A1234BX
    That could be simplified into a single regex, but like most things complex, they're much easier to digest if they're broken down into smaller components.
    HTH

    _________
    broquaint

      My rule of thumb for .* versus .*? is that the former is for grabbing everything after a certain point (I can't be bothered with $') and the latter for grabbing data between 2 points.

      I'd say that .*? is most often useful when grabbing things between two points and the second point is defined by a string of more than one character. If the right hand side can be recognized by a single character I'd suggest a negated character class instead. For example, I'd almost alway prefer using /[^x]*/ to using /.*?x/ because the former is explicit in its exclusion of x's. :-)

      -sauoq
      "My two cents aren't worth a dime.";
      
Re: An "ethical" use of dot-star ..?
by fglock (Vicar) on Jun 02, 2003 at 13:45 UTC

    You could use this construct, instead of the dot-star:

    ([^<]*)
      Yes.

      What does that gain, though, over .*?? Is it more efficient, or is it simply avoiding the "problem" by coding round it?

      Also, to reliably detect a closing tag, you need to match </. There's nothing to stop "<" from appearing in the data (in fact, it's likely for limits we impose locally on credit). The .* construct would have correctly matched "<" without terminating, and continued to match up until it found the </ of a closing tag.

      I don't think I'm convinced that the alternative you suggest would have the same effect on the data, and the data that got grabbed as the .*.

      -- Foxcub
      #include www.liquidfusion.org.uk

        There is an explanation for why this is slightly better, in Death to Dot Star!.

        If you apply the change, the regex will look like:

        $example =~ /^\s*\<(\w+)\s[\w\"\=]+\>([^<]*)\<\//;

        I think it will not have a different effect on the data.

        use YAPE::HTML by japhy


        MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
        I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
        ** The Third rule of perl club is a statement of fact: pod is sexy.

        What does that gain, though, over .*??

        It better expresses what you are actually trying to do. (I''m actually not entirely sure that's true in your case, but it may be.)

        For one, using [^<]* will match a newline. Your original regex will not. You'd have to use a /s modifier for that.

        On the other hand, using [^<]* will simply fail to match on strings like: "<inequality>X < Y</inequality>" but maybe that's fine in your case.

        By the way, yours will fail if there is a space between the '<' and the '/' in the end tag. Maybe you knew that though.... if that's what you wanted, it's fine.

        And that's really the crux of the matter. There is nothing inherently wrong in using a dot-star. It's just misunderstood so often that it's prudent to warn people about it. The other day, I recommended someone use my ($file, $ext) = /(.*)\.(.*)/; to break a filename into its base and extension. Two dot-stars for the price of one there... but — shrug — it did what he needed. The key is understanding what you need and how best to express it. Don't say "zero or more (but as few as possible) of any character except a newline" when you really mean "as many non-Less-Than characters as possible."

        -sauoq
        "My two cents aren't worth a dime.";
        
Re: An "ethical" use of dot-star ..?
by thelenm (Vicar) on Jun 02, 2003 at 16:30 UTC

    Would this count as an acceptable exception to the "Don't Use Dot Star" rule that seems to be prevalent throughout the Monastery?

    If there is such a rule (I don't know that there is), it would seem to be cargo cult programming to me. One should use the appropriate tool for the job. Most of the time, there is a more correct or more efficient solution than dot-star, but if dot-star (greedy or non-greedy) suits your needs and works correctly, then by all means use it.

    -- Mike

    --
    just,my${.02}

Re: An "ethical" use of dot-star ..?
by BrowserUk (Patriarch) on Jun 02, 2003 at 23:46 UTC

    As with most "thou shalt nots" applied to Perls TIMTOWTDI, there is a rational behind them, but just as using the proscribed technique, method or construct blindly without fully understanding what it actually does and the implications that come from it is dangerous, so blindly trying to avoid the proscribed behaviour without understanding the reasoning is equally bad. Maybe more so.

    Sometimes the 'grab all you can' behaviour is exactly the semantic that you need. One assumes, that as it is the 'default' behaviour, the people that designed and maintain the regex engine consider this to be the prevelant requirement.

    Death to Dot Star! and freinds serve the very useful purpose of highlighting the implications of using the feature in an unconstrained way with regard to the implications. However, dogmatically not using the construct, when it is the right tool for the job is equally bad and 'cargo-cultish'.

    Noone would advocate the total removal of the 'rd -r *' facility, despite the very real disasters that can ensue from its incorrect use.

    <tongue-firmly-in-cheek>

    Maybe Perl should prompt the programmer (or even popup a dialog:) with "Are you sure you want to .*"? when ever it encounters a regex that uses it. Or maybe a new /* regex option is called for that says "Yes, I'm using .* and its ok. I'm a programmer and I know what I'm doing" :)

    </tongue-firmly-in-cheek>


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: An "ethical" use of dot-star ..?
by Aristotle (Chancellor) on Jun 03, 2003 at 01:01 UTC
    If you understand backtracking, you won't need to ask such questions. I've recently read Jeffrey Friedl's "Mastering Regular Expressions" (2nd Ed.), and while I knew the key points about pattern matching - even backtracking - already, it helped me put it all together into a bigger image. .* is fine if you know what it means - it just means something very, very different from what people intuitively expect. The star itself is much misunderstood and overused to begin with; you can write much more efficient and precise patterns if you know when not to and when to specifically use it. "Death to Dot Star" touches on the issues (and help me gain much of my pre-book understanding), but it is simply too short to give you a better understand of the big picture.

    Makeshifts last the longest.

      If you understand backtracking, you won't need to ask such questions.

      You do? So, explain already.

      Many of us don't, and maybe that is because the existing texts are too dense on the subject.

      Maybe, given you new-found understanding fresh in your mind, you can put the concept into words that others in your prior condition will be able to follow and assimilate?

        I can try (although I admit I am too lazy to - but see below), but keep in mind that Ovid tried too in his famous node, and while that one helped me a lot, it was no substitute for the book. Regular expressions basically are (as Larry observed in the relevant Apocalypse) a language unto themselves, and noone (or so I'd hope) would expect to learn a programming language from one post on a forum. One thing that struck me as I read the book was that there are many references, both backward and forward, all over the book. I may try at some point (and then it is likely to be a series of nodes rather than just one), but I don't believe I will manage to explain the mechanics better than the book, and certainly not in as much depth.

        Makeshifts last the longest.