wossname has asked for the wisdom of the Perl Monks concerning the following question:

I'm in the process of learning Perl (5.8) in my spare time and as a mental (and spiritual) excercise I am attempting to design a regular expression that will enable me to convert raw "METAR" weather reports into something that is more human readable.

I know how to do the legwork to make this possible... but I have a question about efficiency.

I'm breaking this task down into sections. For each section I am using "named capture" to cache each important value to a variable. This works nicely. But due to the (very) complex nature of the METAR standard, I am forced to use many levels of nested grouping.

So my question is, can I speed up the following...

m/((xxxxxxxx)(yyyyyyyy)(zzzzzzzz))/

...by using "non-capturing parentheses" for the groups below the outer nesting level?...

m/((?:xxxxxxxx)(?:yyyyyyyy)(?:zzzzzzzz))/

I've reduced my problem down to basic principles as best I can. As I understand it, the outer parentheses will still capture everything inside.

So, is that second regex pattern more efficient than the first?

I would benchmark this myself normally, but I'm only about 20% of the way through designing the entire regex for METAR codes and I don't want to go too far down "the wrong path" if you see what I mean.

Advice gratefully received.

Wossy. :)
ps. METAR codes specification: (http://www.ofcm.gov/fmh-1/fmh1.htm see chapter 12 if you're interested, although the entire document is very interesting in itself).

Replies are listed 'Best First'.
Re: Speeding up a large regex match pattern
by Fletch (Bishop) on Feb 04, 2009 at 21:04 UTC

    As to your question, if you're having to go through a whole bunch of hoops and nested captures and what not it's probably a sign that just regexen aren't going to be the best solution to the problem. You probably want to aim more at writing an actual parser and use simpler regexen to break the string into tokens.

    Also you say this is a learning exercise but be aware it's a solved problem (Geo::METAR).

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: Speeding up a large regex match pattern
by roboticus (Chancellor) on Feb 04, 2009 at 21:04 UTC
    wossname:

    You might find node Weather reports on the HP4350's front panel helpful...

    I know nothing of METAR, and I've never used Parser::RecDescent, so take this with a grain of salt...

    I suspect that if the METAR reports are of any significant complexity, it may be worth your while to make a parser (using the aforementioned Parser::RecDescent module). Regular expressions are great, but when things get complex enough, you should use a parser for the job. As an example, many people use simple regular expressions to do things like remove bold markers from HTML documents. But when they start doing more complex things, you frequently see suggestions to use HTML::Parser and brethren...

    ...roboticus
Re: Speeding up a large regex match pattern
by BrowserUk (Patriarch) on Feb 04, 2009 at 23:53 UTC

    To answer the question you asked: Yes, replacing nested capturing subgroups with non-capturing parens will improve the performance of your regexes.

    In a rather crude test, almost exactly 1 order of magnitude improvement for each pair of capturing parens that you replace. That is m[((?:xxx)(?:yyy))]g runs almost exactly twice as fast as m[((xxx)(yyy))]g; and m[((?:xxx)(?:yyy)(?:zzz))]g runs almost exactly 3 times faster than m[((xxx)(yyy)(zzz))]g;; and so on.

    In that test, I was assigning all the captures to an array, which obviously results in differing numbers of values in the arrays.

    If you are only refering to the captures you are interested in by their aliases ($1, and ignoring $2 and $3 etc.), then the impact of using the capturing parens that you don't need is less (around 15%), but it is still significant.

    The upshot is: only capture what you need, and use non-capturing parens anywhere else.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Speeding up a large regex match pattern
by gone2015 (Deacon) on Feb 04, 2009 at 21:12 UTC

    As a general proposition, avoiding using unnecessary captures is a Good Thing.

    Also as a general proposition, optimisation should be avoided until it becomes obvious that it is unavoidable.

    use re 'debug' can tell you a lot about what's going on -- getting a trace of what happens on some real data can be quite instructive. YAPE::Regex::Explain is reputed to be useful, but it does look a little old -- other monks may be able to advise.

    One thing that can really slow down regex is back-tracking. The more specific the matching elements, the less opportunity you give for backtracking, the better. Note that a floating start to a regex is an excellent way of triggering a lot of backtracking !

      Thanks for the fast replies everyone!

      Yeah I realise this METAR thing has been done many times and in many ways. I am merely using it as a means to test my knowledge of regular expressions.

      I have no real use for a METAR processor script but the huge amount of freely available data combined with the significant complexity of the protocol make it an ideal case study for learning the ropes, as it were.

      I'm interested in the links that you kindly posted though. I think I'll probably finish my "hard way" METAR script first and then I'll try it again using a proper parser. That way I think I'll get the benefit from both coding the logic manually and also taking a more high level approach.

      Thanks again.

      Wosser.

      I'm so glad I found perlmonks.org, this place is a far more substantial knowledge base than anywhere else I've found.
Re: Speeding up a large regex match pattern
by ww (Archbishop) on Feb 04, 2009 at 21:26 UTC

    You may need nested grouping toward the content based on the end of the spec, but at least the early part of a METAR should give you no concern.

    The following approach embodies the advice (brilliant minds come to similar solutions?) re breaking the METAR or SPECI down into tokens:

    #!/usr/bin/perl -lw use strict; my $metar = 'METAR CCCC 210855Z AUTO dddff(f)Gfmfm(fm)KTv dndndnVdxdxd +x VVVVVSM [RDRDR/VRVRVRVRFT ] w\'w\' [NsNsNshshshs or VVhshshs T\'T\' +/T\'dT\'d APHPHPHPH'; if ( $metar =~ /^(METAR|SPECI)\s([A-Z]{4})\s(\d\d)(\d{4}Z)\s(AUTO|COR) +.*/) { my $type= $1; my $stn= $2; my $monthday = $3; my $obsTime= $4; my +$subtype= $5; print $type . " Station: $stn, Date: $monthday, ObservationTime: $ +obsTime, Subtype: $subtype"; } else { print "ww screwed up the regex."; }
    ($metar is taken straight from the ref doc, with digits substituted for YYGGgg and with the ticks escaped, as comprehensively understanding the spec would require more time than I have to give, just now.)
      The METAR spec document seems to use a fairly peculiar method of notation (at least to my eyes) to describe the various fields, most of which are optional.

      It is a very challenging pattern to match though, no doubt about that.

      One thing that does strike me is that it would appear that many parts of the METAR encoding could be greatly improved by reversing the order of some of the fragments of information.

      I had some significant difficulty with matching the visibility group "VVVVVSM". The most obvious optimisation would be to move the "SM" to the start of the group, not the end. Ho hum.

      Cheers.
Re: Speeding up a large regex match pattern
by planetscape (Chancellor) on Feb 05, 2009 at 19:31 UTC
      If the goal is to speed things up, PRD is probably not the way to go, although it could be a useful tool to help develop the grammar that will later be implemented using more efficient means.
        Thank you for the help everyone. Consider this question answered. :)