The regexp compiler in Perl is an incredible thing, but it is
not always clear how it works its magic. My concern revolves
around large regexps, and how they can be optimized to improve
speed, something that I'm sure comes up a lot.
I was wondering, should I optimize my regexps, or will Perl
do that for me anyway when it compiles which would make my
effort wasted?
So I composed a quick test to find out.
Which one of the following functionally equivalent
statements is the fastest way to find pattern matches?
0: /at/||/bt/||/ct/||/dt/||/et/||/ft/||/ght/
1: /at|bt|ct|dt|et|ft|ght/
2: /[abcdef]t|ght/
3: /(?:[abcdef]|gh)t/
4: /(?:a|b|c|d|e|f|gt)t/
I was expecting that Perl would compile 1..4 to exactly
the same thing, but
Benchmark shows that is not the case.
Internally, there must be substantial differences in the
way they are implemented.
The test I composed used data from
/usr/dict/words
in two ways. One was to test on many small bits of data,
such as the words, and the other was to test on the entire
file.
On small data (
@test = <DICT>) the winner is 0, which
I found surprising. However, it is only 15% faster than
2 and 3, which were tied for second (not unexpectedly).
Then came 1, which was only slightly faster than 4, and the two
of them are 45%+ slower than the others. The
regexp 'or' operator sure slows things down, it would seem.
On the large dataset (
@test = join (',',<DICT>)), 0 was too cumbersome to be implemented.
The rest came up in the same order, 2,3,1,4, as before. No
surprise there.
I'm not sure if I need a more ambitious test, but I think it
is clear that the compiler doesn't do everything for you by
any degree, and that the programmer certainly has to make an
effort to construct the most efficient regexp.
I'm curious, though, why patterns 1..4 aren't compiled the
same.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.