in reply to perl performance vs egrep
egrep will be faster than Perl here. ambrus is right on as well, this is an issue where I/O will limit you, not CPU.
In any case, if you want to use Perl, always check the regex compiler's decisions using the re pragma when speed matters.
In this case you can accelerate the regex by pulling the anchor out of the alternation. Pay attention to how the compiler only says “minlen 5” for the first version (it inferred that a string with less than 5 characters cannot possibly match), while with the second with pulled out anchor it found “anchored(BOL) minlen 5” (the regex is also anchored at the beginning of the string). The regex engine will actually apply the first pattern at every character position of a line (except the last four, where it knows there are not enough characters to satisfy the pattern), while it will only bother to apply the second regex at the beginning of the line because it knows it cannot match anywhere else.
Sometimes just a very subtle change in a pattern can trigger or defeat such optimizations, so be sure to check when you need to eke out every last ounce of speed. And then benchmark it when an optimization or lack thereof is not a sure win or loss; sometimes it is not obvious how certain decisions on the compiler's part will affect performance on different kinds of data sets.
With egrep using a DFA-based matcher on the other hand there is absolutely no difference between different ways of writing the same pattern.
$ perl -Mre=debug -e'/^CPXX1|^KLXX1|^KMXX1|^MEXX1|^PAXX1|^PMXX1|^SLXX1 +|^SZXX1|^WXXX1|^YZXX1/' Compiling REx `^CPXX1|^KLXX1|^KMXX1|^MEXX1|^PAXX1|^PMXX1|^SLXX1|^SZXX1 +|^WXXX1|^YZXX1' size 51 Got 412 bytes for offset annotations. 1: BRANCH(6) 2: BOL(3) 3: EXACT <CPXX1>(51) 6: BRANCH(11) 7: BOL(8) 8: EXACT <KLXX1>(51) 11: BRANCH(16) 12: BOL(13) 13: EXACT <KMXX1>(51) 16: BRANCH(21) 17: BOL(18) 18: EXACT <MEXX1>(51) 21: BRANCH(26) 22: BOL(23) 23: EXACT <PAXX1>(51) 26: BRANCH(31) 27: BOL(28) 28: EXACT <PMXX1>(51) 31: BRANCH(36) 32: BOL(33) 33: EXACT <SLXX1>(51) 36: BRANCH(41) 37: BOL(38) 38: EXACT <SZXX1>(51) 41: BRANCH(46) 42: BOL(43) 43: EXACT <WXXX1>(51) 46: BRANCH(51) 47: BOL(48) 48: EXACT <YZXX1>(51) 51: END(0) minlen 5 Offsets: [51] 0[0] 1[1] 2[5] 0[0] 0[0] 7[1] 8[1] 9[5] 0[0] 0[0] 14[1] 15[1] 16[5 +] 0[0] 0[0] 21[1] 22[1] 23[5] 0[0] 0[0] 28[1] 29[1] 30[5] 0[0] 0[0] 3 +5[1] 36[1] 37[5] 0[0] 0[0] 42[1] 43[1] 44[5] 0[0] 0[0] 49[1] 50[1] 51 +[5] 0[0] 0[0] 56[1] 57[1] 58[5] 0[0] 0[0] 63[1] 64[1] 65[5] 0[0] 0[0] + 70[0] Freeing REx: `"^CPXX1|^KLXX1|^KMXX1|^MEXX1|^PAXX1|^PMXX1|^SLXX1|^SZXX1 +|^WX"......' $ perl -Mre=debug -e'/^(?:CPXX1|KLXX1|KMXX1|MEXX1|PAXX1|PMXX1|SLXX1|SZ +XX1|WXXX1|YZXX1)/' Compiling REx `^(?:CPXX1|KLXX1|KMXX1|MEXX1|PAXX1|PMXX1|SLXX1|SZXX1|WXX +X1|YZXX1)' size 43 Got 348 bytes for offset annotations. first at 2 1: BOL(2) 2: BRANCH(6) 3: EXACT <CPXX1>(43) 6: BRANCH(10) 7: EXACT <KLXX1>(43) 10: BRANCH(14) 11: EXACT <KMXX1>(43) 14: BRANCH(18) 15: EXACT <MEXX1>(43) 18: BRANCH(22) 19: EXACT <PAXX1>(43) 22: BRANCH(26) 23: EXACT <PMXX1>(43) 26: BRANCH(30) 27: EXACT <SLXX1>(43) 30: BRANCH(34) 31: EXACT <SZXX1>(43) 34: BRANCH(38) 35: EXACT <WXXX1>(43) 38: BRANCH(42) 39: EXACT <YZXX1>(43) 42: TAIL(43) 43: END(0) anchored(BOL) minlen 5 Offsets: [43] 1[1] 4[1] 5[5] 0[0] 0[0] 10[1] 11[5] 0[0] 0[0] 16[1] 17[5] 0[0] 0[ +0] 22[1] 23[5] 0[0] 0[0] 28[1] 29[5] 0[0] 0[0] 34[1] 35[5] 0[0] 0[0] +40[1] 41[5] 0[0] 0[0] 46[1] 47[5] 0[0] 0[0] 52[1] 53[5] 0[0] 0[0] 58[ +1] 59[5] 0[0] 0[0] 63[0] 65[0] Freeing REx: `"^(?:CPXX1|KLXX1|KMXX1|MEXX1|PAXX1|PMXX1|SLXX1|SZXX1|WXX +X1|Y"......'
Makeshifts last the longest.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: perl performance vs egrep
by Random_Walk (Prior) on Jan 24, 2005 at 14:43 UTC |