If you are familiar with regular expressions, don't make the mistake of thinking that all regular expression engines are the same. They fall into three general categories:
Between POSIX and traditional NFA engines, both can take long time verifying a failed match (sometimes longer than the than the length of time the universe has existed!), but traditional NFA tends to be much faster on successful matches.
Optimizing regular expressions, however, takes a lot of work and is far beyond the scope of this post. One thing you can do is read my shameless plug. However, the issues raised in that post aren't terribly relevant to what you're doing here.
The issue you are having is the following: when dealing with choices for single character, use a character class, but don't alternate!!! The following seem equivalent, but they're not:
It is almost guaranteed that the first will run faster. The reason for this is simple: when dealing with a character class, Perl's regex engine kind of creates a "sieve" that the characters pass through. If it doesn't go through the "abc" sieve, it doesn't pass. When you alternate, as in the second example, Perl's regex engine attempts to match the 'a'. If that fails, backs up to the end of the last successful match and tries the 'b'. If that fails, it again notes the end of the last successful match and tries 'c'. In short, it's forced to mark its position and continuously retry the match, but the character class simply allows 'a', 'b', or 'c' to pass.$string =~ /[abc]/; # character class $string =~ /a|b|c/; # alternation
The alternation problem is worsened when dealing with multiple characters.
The first example will simply try to match the target strings. The second example will keep trying to match, move forward a character after the failure, try to match again, etc, etc (no, this isn't quite accurate, but I'm trying to be brief). This is almost guaranteed to be slower. However, these speed issues are more of a concern when iterating over data. If it's a single shot regex, it's usually not that big of a consideration.$string =~ /bob/ || /alice/; # Good $string =~ /(?:bob|alice)/; # Bad
Regardless of all of the drivel here, it's important to take into consideration the actual data to be used. Testing against hypothetical data that really doesn't match what you are going to use is a poor test of the efficiency of your solution (this is just an aside. I have no idea if it's directly applicable to what you're doing). However, if you create sample data that closely mimics what you want, benchmarking your data is much more likely to produce relevant results.
Cheers,
Ovid
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.
In reply to (Ovid) Re: Regexp Speed Concerns
by Ovid
in thread Regexp Speed Concerns
by tadman
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |