The regexp compiler in Perl is an incredible thing, but it is not always clear how it works its magic. My concern revolves around large regexps, and how they can be optimized to improve speed, something that I'm sure comes up a lot.

I was wondering, should I optimize my regexps, or will Perl do that for me anyway when it compiles which would make my effort wasted?

So I composed a quick test to find out.

Which one of the following functionally equivalent statements is the fastest way to find pattern matches?
0: /at/||/bt/||/ct/||/dt/||/et/||/ft/||/ght/ 1: /at|bt|ct|dt|et|ft|ght/ 2: /[abcdef]t|ght/ 3: /(?:[abcdef]|gh)t/ 4: /(?:a|b|c|d|e|f|gt)t/
I was expecting that Perl would compile 1..4 to exactly the same thing, but Benchmark shows that is not the case. Internally, there must be substantial differences in the way they are implemented.

The test I composed used data from /usr/dict/words in two ways. One was to test on many small bits of data, such as the words, and the other was to test on the entire file.

On small data (@test = <DICT>) the winner is 0, which I found surprising. However, it is only 15% faster than 2 and 3, which were tied for second (not unexpectedly). Then came 1, which was only slightly faster than 4, and the two of them are 45%+ slower than the others. The regexp 'or' operator sure slows things down, it would seem.

On the large dataset (@test = join (',',<DICT>)), 0 was too cumbersome to be implemented. The rest came up in the same order, 2,3,1,4, as before. No surprise there.

I'm not sure if I need a more ambitious test, but I think it is clear that the compiler doesn't do everything for you by any degree, and that the programmer certainly has to make an effort to construct the most efficient regexp.

I'm curious, though, why patterns 1..4 aren't compiled the same.

In reply to Regexp Speed Concerns by tadman

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.