Speeding up a large regex match pattern

wossname has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Speeding up a large regex match pattern by Fletch (Bishop) on Feb 04, 2009 at 21:04 UTC
As to your question, if you're having to go through a whole bunch of hoops and nested captures and what not it's probably a sign that just regexen aren't going to be the best solution to the problem. You probably want to aim more at writing an actual parser and use simpler regexen to break the string into tokens. Also you say this is a learning exercise but be aware it's a solved problem (Geo::METAR). The cake is a lie. The cake is a lie. The cake is a lie.	[reply]
Re: Speeding up a large regex match pattern by roboticus (Chancellor) on Feb 04, 2009 at 21:04 UTC
wossname: You might find node Weather reports on the HP4350's front panel helpful... I know nothing of METAR, and I've never used Parser::RecDescent, so take this with a grain of salt... I suspect that if the METAR reports are of any significant complexity, it may be worth your while to make a parser (using the aforementioned Parser::RecDescent module). Regular expressions are great, but when things get complex enough, you should use a parser for the job. As an example, many people use simple regular expressions to do things like remove bold markers from HTML documents. But when they start doing more complex things, you frequently see suggestions to use HTML::Parser and brethren... ...roboticus	[reply]
Re: Speeding up a large regex match pattern by BrowserUk (Patriarch) on Feb 04, 2009 at 23:53 UTC
To answer the question you asked: Yes, replacing nested capturing subgroups with non-capturing parens will improve the performance of your regexes. In a rather crude test, almost exactly 1 order of magnitude improvement for each pair of capturing parens that you replace. That is `m[((?:xxx)(?:yyy))]g` runs almost exactly twice as fast as `m[((xxx)(yyy))]g`; and `m[((?:xxx)(?:yyy)(?:zzz))]g` runs almost exactly 3 times faster than `m[((xxx)(yyy)(zzz))]g;`; and so on. In that test, I was assigning all the captures to an array, which obviously results in differing numbers of values in the arrays. If you are only refering to the captures you are interested in by their aliases (`$1`, and ignoring $2 and $3 etc.), then the impact of using the capturing parens that you don't need is less (around 15%), but it is still significant. The upshot is: only capture what you need, and use non-capturing parens anywhere else. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re: Speeding up a large regex match pattern by gone2015 (Deacon) on Feb 04, 2009 at 21:12 UTC
As a general proposition, avoiding using unnecessary captures is a Good Thing. Also as a general proposition, optimisation should be avoided until it becomes obvious that it is unavoidable. `use re 'debug'` can tell you a lot about what's going on -- getting a trace of what happens on some real data can be quite instructive. YAPE::Regex::Explain is reputed to be useful, but it does look a little old -- other monks may be able to advise. One thing that can really slow down regex is back-tracking. The more specific the matching elements, the less opportunity you give for backtracking, the better. Note that a floating start to a regex is an excellent way of triggering a lot of backtracking !	[reply] [d/l]
Re^2: Speeding up a large regex match pattern by wossname (Novice) on Feb 04, 2009 at 21:35 UTC
Thanks for the fast replies everyone! Yeah I realise this METAR thing has been done many times and in many ways. I am merely using it as a means to test my knowledge of regular expressions. I have no real use for a METAR processor script but the huge amount of freely available data combined with the significant complexity of the protocol make it an ideal case study for learning the ropes, as it were. I'm interested in the links that you kindly posted though. I think I'll probably finish my "hard way" METAR script first and then I'll try it again using a proper parser. That way I think I'll get the benefit from both coding the logic manually and also taking a more high level approach. Thanks again. Wosser. I'm so glad I found perlmonks.org, this place is a far more substantial knowledge base than anywhere else I've found.	[reply]
Re: Speeding up a large regex match pattern by ww (Archbishop) on Feb 04, 2009 at 21:26 UTC
You may need nested grouping toward the content based on the end of the spec, but at least the early part of a METAR should give you no concern. The following approach embodies the advice (brilliant minds come to similar solutions?) re breaking the METAR or SPECI down into tokens: `#!/usr/bin/perl -lw use strict; my $metar = 'METAR CCCC 210855Z AUTO dddff(f)Gfmfm(fm)KTv dndndnVdxdxd +x VVVVVSM [RDRDR/VRVRVRVRFT ] w\'w\' [NsNsNshshshs or VVhshshs T\'T\' +/T\'dT\'d APHPHPHPH'; if ( $metar =~ /^(METAR\|SPECI)\s([A-Z]{4})\s(\d\d)(\d{4}Z)\s(AUTO\|COR) +.*/) { my $type= $1; my $stn= $2; my $monthday = $3; my $obsTime= $4; my +$subtype= $5; print $type . " Station: $stn, Date: $monthday, ObservationTime: $ +obsTime, Subtype: $subtype"; } else { print "ww screwed up the regex."; }` [download] (`$metar` is taken straight from the ref doc, with digits substituted for YYGGgg and with the ticks escaped, as comprehensively understanding the spec would require more time than I have to give, just now.)	[reply] [d/l] [select]
Re^2: Speeding up a large regex match pattern by wossname (Novice) on Feb 04, 2009 at 21:47 UTC
The METAR spec document seems to use a fairly peculiar method of notation (at least to my eyes) to describe the various fields, most of which are optional. It is a very challenging pattern to match though, no doubt about that. One thing that does strike me is that it would appear that many parts of the METAR encoding could be greatly improved by reversing the order of some of the fragments of information. I had some significant difficulty with matching the visibility group "VVVVVSM". The most obvious optimisation would be to move the "SM" to the start of the group, not the end. Ho hum. Cheers.	[reply]
Re: Speeding up a large regex match pattern by planetscape (Chancellor) on Feb 05, 2009 at 19:31 UTC
If you decide to go the route suggested by Fletch and roboticus of writing/using a parser, you may find Parse::RecDescent useful. HTH, planetscape	[reply]
Re^2: Speeding up a large regex match pattern by ikegami (Patriarch) on Feb 05, 2009 at 19:34 UTC
If the goal is to speed things up, PRD is probably not the way to go, although it could be a useful tool to help develop the grammar that will later be implemented using more efficient means.	[reply]
Re^3: Speeding up a large regex match pattern by Anonymous Monk on Feb 06, 2009 at 12:19 UTC
Thank you for the help everyone. Consider this question answered. :)	[reply]