(Ovid) Re: Regexp Speed Concerns

As has been commented previously, you should read Mastering Regular Expressions. It is the definitive work on regexes and has no peer. Unfortunately, it is rather out of date, but the author, Jeffrey Friedl is writing a third edition which I understand should be out shortly.

If you are familiar with regular expressions, don't make the mistake of thinking that all regular expression engines are the same. They fall into three general categories:

DFA. Fast, but does not allow back references (e.g. capture to $1). I understand that someone is writing a DFA engine that can do backreferences, but I'm too lazy to look up the reference right now :)

Traditional NFA. This is the regex engine that Perl uses and, amongst "those in the know", is often considered the best. It's fairly fast and allows backreferences.

POSIX NFA. Every once in a while, a standards body comes out with a poor standard. This is one of them. POSIX NFA usually tries for the "longest" match and, as such, can take an incredible amount of time with some regular expressions. Traditional NFA usually goes for the first match, thus allowing greater speed.

Between POSIX and traditional NFA engines, both can take long time verifying a failed match (sometimes longer than the than the length of time the universe has existed!), but traditional NFA tends to be much faster on successful matches.

Optimizing regular expressions, however, takes a lot of work and is far beyond the scope of this post. One thing you can do is read my shameless plug. However, the issues raised in that post aren't terribly relevant to what you're doing here.

The issue you are having is the following: when dealing with choices for single character, use a character class, but don't alternate!!! The following seem equivalent, but they're not:

$string =~ /[abc]/; # character class
$string =~ /a|b|c/; # alternation
[download]

It is almost guaranteed that the first will run faster. The reason for this is simple: when dealing with a character class, Perl's regex engine kind of creates a "sieve" that the characters pass through. If it doesn't go through the "abc" sieve, it doesn't pass. When you alternate, as in the second example, Perl's regex engine attempts to match the 'a'. If that fails, backs up to the end of the last successful match and tries the 'b'. If that fails, it again notes the end of the last successful match and tries 'c'. In short, it's forced to mark its position and continuously retry the match, but the character class simply allows 'a', 'b', or 'c' to pass.

The alternation problem is worsened when dealing with multiple characters.

$string =~ /bob/ || /alice/; # Good
$string =~ /(?:bob|alice)/;  # Bad
[download]

The first example will simply try to match the target strings. The second example will keep trying to match, move forward a character after the failure, try to match again, etc, etc (no, this isn't quite accurate, but I'm trying to be brief). This is almost guaranteed to be slower. However, these speed issues are more of a concern when iterating over data. If it's a single shot regex, it's usually not that big of a consideration.

Regardless of all of the drivel here, it's important to take into consideration the actual data to be used. Testing against hypothetical data that really doesn't match what you are going to use is a poor test of the efficiency of your solution (this is just an aside. I have no idea if it's directly applicable to what you're doing). However, if you create sample data that closely mimics what you want, benchmarking your data is much more likely to produce relevant results.

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Comment on (Ovid) Re: Regexp Speed Concerns Select or Download Code