in reply to perl performance vs egrep

egrep will be faster than Perl here. ambrus is right on as well, this is an issue where I/O will limit you, not CPU.

In any case, if you want to use Perl, always check the regex compiler's decisions using the re pragma when speed matters.

In this case you can accelerate the regex by pulling the anchor out of the alternation. Pay attention to how the compiler only says “minlen 5” for the first version (it inferred that a string with less than 5 characters cannot possibly match), while with the second with pulled out anchor it found “anchored(BOL) minlen 5” (the regex is also anchored at the beginning of the string). The regex engine will actually apply the first pattern at every character position of a line (except the last four, where it knows there are not enough characters to satisfy the pattern), while it will only bother to apply the second regex at the beginning of the line because it knows it cannot match anywhere else.

Sometimes just a very subtle change in a pattern can trigger or defeat such optimizations, so be sure to check when you need to eke out every last ounce of speed. And then benchmark it when an optimization or lack thereof is not a sure win or loss; sometimes it is not obvious how certain decisions on the compiler's part will affect performance on different kinds of data sets.

With egrep using a DFA-based matcher on the other hand there is absolutely no difference between different ways of writing the same pattern.

$ perl -Mre=debug -e'/^CPXX1|^KLXX1|^KMXX1|^MEXX1|^PAXX1|^PMXX1|^SLXX1 +|^SZXX1|^WXXX1|^YZXX1/' Compiling REx `^CPXX1|^KLXX1|^KMXX1|^MEXX1|^PAXX1|^PMXX1|^SLXX1|^SZXX1 +|^WXXX1|^YZXX1' size 51 Got 412 bytes for offset annotations. 1: BRANCH(6) 2: BOL(3) 3: EXACT <CPXX1>(51) 6: BRANCH(11) 7: BOL(8) 8: EXACT <KLXX1>(51) 11: BRANCH(16) 12: BOL(13) 13: EXACT <KMXX1>(51) 16: BRANCH(21) 17: BOL(18) 18: EXACT <MEXX1>(51) 21: BRANCH(26) 22: BOL(23) 23: EXACT <PAXX1>(51) 26: BRANCH(31) 27: BOL(28) 28: EXACT <PMXX1>(51) 31: BRANCH(36) 32: BOL(33) 33: EXACT <SLXX1>(51) 36: BRANCH(41) 37: BOL(38) 38: EXACT <SZXX1>(51) 41: BRANCH(46) 42: BOL(43) 43: EXACT <WXXX1>(51) 46: BRANCH(51) 47: BOL(48) 48: EXACT <YZXX1>(51) 51: END(0) minlen 5 Offsets: [51] 0[0] 1[1] 2[5] 0[0] 0[0] 7[1] 8[1] 9[5] 0[0] 0[0] 14[1] 15[1] 16[5 +] 0[0] 0[0] 21[1] 22[1] 23[5] 0[0] 0[0] 28[1] 29[1] 30[5] 0[0] 0[0] 3 +5[1] 36[1] 37[5] 0[0] 0[0] 42[1] 43[1] 44[5] 0[0] 0[0] 49[1] 50[1] 51 +[5] 0[0] 0[0] 56[1] 57[1] 58[5] 0[0] 0[0] 63[1] 64[1] 65[5] 0[0] 0[0] + 70[0] Freeing REx: `"^CPXX1|^KLXX1|^KMXX1|^MEXX1|^PAXX1|^PMXX1|^SLXX1|^SZXX1 +|^WX"......' $ perl -Mre=debug -e'/^(?:CPXX1|KLXX1|KMXX1|MEXX1|PAXX1|PMXX1|SLXX1|SZ +XX1|WXXX1|YZXX1)/' Compiling REx `^(?:CPXX1|KLXX1|KMXX1|MEXX1|PAXX1|PMXX1|SLXX1|SZXX1|WXX +X1|YZXX1)' size 43 Got 348 bytes for offset annotations. first at 2 1: BOL(2) 2: BRANCH(6) 3: EXACT <CPXX1>(43) 6: BRANCH(10) 7: EXACT <KLXX1>(43) 10: BRANCH(14) 11: EXACT <KMXX1>(43) 14: BRANCH(18) 15: EXACT <MEXX1>(43) 18: BRANCH(22) 19: EXACT <PAXX1>(43) 22: BRANCH(26) 23: EXACT <PMXX1>(43) 26: BRANCH(30) 27: EXACT <SLXX1>(43) 30: BRANCH(34) 31: EXACT <SZXX1>(43) 34: BRANCH(38) 35: EXACT <WXXX1>(43) 38: BRANCH(42) 39: EXACT <YZXX1>(43) 42: TAIL(43) 43: END(0) anchored(BOL) minlen 5 Offsets: [43] 1[1] 4[1] 5[5] 0[0] 0[0] 10[1] 11[5] 0[0] 0[0] 16[1] 17[5] 0[0] 0[ +0] 22[1] 23[5] 0[0] 0[0] 28[1] 29[5] 0[0] 0[0] 34[1] 35[5] 0[0] 0[0] +40[1] 41[5] 0[0] 0[0] 46[1] 47[5] 0[0] 0[0] 52[1] 53[5] 0[0] 0[0] 58[ +1] 59[5] 0[0] 0[0] 63[0] 65[0] Freeing REx: `"^(?:CPXX1|KLXX1|KMXX1|MEXX1|PAXX1|PMXX1|SLXX1|SZXX1|WXX +X1|Y"......'

Makeshifts last the longest.

Replies are listed 'Best First'.
Re^2: perl performance vs egrep
by Random_Walk (Prior) on Jan 24, 2005 at 14:43 UTC

    This version looks to give the optimiser even more hints anchored `XX1' at 2 (checking anchored) anchored(BOL). Now if only we had some test data to try the Sys::Mmap optimisation and demerphq's++ nice inversion of the problem at 424532

    nph>perl -Mre=debug -e'/^(CP|K[LM]|ME|P[AM]|S[LZ]|WX|YZ)XX1/' Freeing REx: `","' Compiling REx `^(CP|K[LM]|ME|P[AM]|S[LZ]|WX|YZ)XX1' size 62 Got 500 bytes for offset annotations. first at 2 1: BOL(2) 2: OPEN1(4) 4: BRANCH(7) 5: EXACT <CP>(58) 7: BRANCH(21) 8: EXACT <K>(10) 10: ANYOF[LM](58) 21: BRANCH(24) 22: EXACT <ME>(58) 24: BRANCH(38) 25: EXACT <P>(27) 27: ANYOF[AM](58) 38: BRANCH(52) 39: EXACT <S>(41) 41: ANYOF[LZ](58) 52: BRANCH(55) 53: EXACT <WX>(58) 55: BRANCH(58) 56: EXACT <YZ>(58) 58: CLOSE1(60) 60: EXACT <XX1>(62) 62: END(0) anchored `XX1' at 2 (checking anchored) anchored(BOL) minlen 5 Offsets: [62] 1[1] 2[1] 0[0] 2[1] 3[2] 0[0] 5[1] 6[1] 0[0] 7[4] 0[0] 0[0] 0[ +0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 11[1] 12[2] 0[0] 14[1] 15[1] 0[ +0] 16[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 20[1] 21[1 +] 0[0] 22[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 26[1] +27[2] 0[0] 29[1] 30[2] 0[0] 32[1] 0[0] 33[3] 0[0] 36[0] Freeing REx: `"^(CP|K[LM]|ME|P[AM]|S[LZ]|WX|YZ)XX1"'

    update

    as the OP says using egrep is a valid solution he must not need to capture any parts of the match so lets make this non-capturing and revist the debug, this time with added interactivity.
    nph>perl -Mre=debug -ne'/^(?:CP|K[LM]|ME|P[AM]|S[LZ]|WX|YZ)XX1/;print" +>"' Freeing REx: `","' Compiling REx `^(?:CP|K[LM]|ME|P[AM]|S[LZ]|WX|YZ)XX1' size 59 Got 476 bytes for offset annotations. first at 2 1: BOL(2) 2: BRANCH(5) 3: EXACT <CP>(57) 5: BRANCH(19) 6: EXACT <K>(8) 8: ANYOF[LM](57) 19: BRANCH(22) 20: EXACT <ME>(57) 22: BRANCH(36) 23: EXACT <P>(25) 25: ANYOF[AM](57) 36: BRANCH(50) 37: EXACT <S>(39) 39: ANYOF[LZ](57) 50: BRANCH(53) 51: EXACT <WX>(57) 53: BRANCH(56) 54: EXACT <YZ>(57) 56: TAIL(57) 57: EXACT <XX1>(59) 59: END(0) anchored `XX1' at 2 (checking anchored) anchored(BOL) minlen 5 Offsets: [59] 1[1] 4[1] 5[2] 0[0] 7[1] 8[1] 0[0] 9[4] 0[0] 0[0] 0[0] 0[0] 0[ +0] 0[0] 0[0] 0[0] 0[0] 0[0] 13[1] 14[2] 0[0] 16[1] 17[1] 0[0] 18[4] 0 +[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 22[1] 23[1] 0[0] 24[ +4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 28[1] 29[2] 0[0] + 31[1] 32[2] 0[0] 33[0] 35[3] 0[0] 38[0] >SLXC1 Guessing start of match, REx `^(?:CP|K[LM]|ME|P[AM]|S[LZ]|WX|YZ)XX1' a +gainst `SLXC1 '... String not equal... Match rejected by optimizer >

    If you knew the frequency of each of the pre-fixes you would optimise by putting the most common first, or perhaps if there were a vast number of lines with something you certainly did not want (e.g. RWXX1) then you may even add a [^R] in there right at the start.

    Cheers,
    R.

    Pereant, qui ante nos nostra dixerunt!