I don't think that index must
be slower than a regex. It seems to me that it should be
much faster as it should be a much simpler function, and regexen don't do
bulk alternation very quickly either. Of course you may be able to represent 100 simple
text strings in fewer than 50 sub-expressions in a hand-optimized
regex, so it all depends on how much work you want to have to
put into this, and how often you're going to change the list.
Perhaps a benchmark is in order here. Now I
expect the relative speeds will vary quite a bit as the size of
the string to search, the number of words in the match list, and
the frequency of matches changes, but this may tell us something
nonetheless (or not ;-).
It looks like index is the winner in this simple test, but I trust
that wiser monks will let me know if and how I'm producing bad results!
First results when there is no match found:
Benchmark: running BigRE, Index, OptRE, m// loop, each for at least 10
+ CPU seconds...
BigRE: 11 wallclock secs (10.56 usr + 0.00 sys = 10.56 CPU) @ 11
+690.90/s (n=123491)
Index: 11 wallclock secs (10.61 usr + 0.00 sys = 10.61 CPU) @ 43
+921.29/s (n=465961)
OptRE: 11 wallclock secs (10.59 usr + 0.00 sys = 10.59 CPU) @ 17
+239.59/s (n=182619)
m// loop: 11 wallclock secs (10.50 usr + 0.00 sys = 10.50 CPU) @ 68
+25.71/s (n=71670)
Then results when there is always a match found:
Benchmark: running BigRE, Index, OptRE, m// loop, each for at least 10
+ CPU seconds...
BigRE: 11 wallclock secs (10.39 usr + 0.00 sys = 10.39 CPU) @ 12
+548.26/s (n=130389)
Index: 10 wallclock secs (10.53 usr + 0.00 sys = 10.53 CPU) @ 56
+956.89/s (n=599813)
OptRE: 11 wallclock secs (10.53 usr + 0.00 sys = 10.53 CPU) @ 17
+720.63/s (n=186616)
m// loop: 11 wallclock secs (10.48 usr + 0.00 sys = 10.48 CPU) @ 78
+73.90/s (n=82550)
Update: Damn, I knew I was missing something (probably still am...) - thanks dws! I
don't keep spam around so here's the results for a chunk of
12k of text-only e-mail (as spam tends to be long-winded). I skipped the hand-optimized
regex as I used nonsense words in the next case to ensure no match would be found and I didn't
feel like thinking about it too hard. In both the following cases there are still
just 7 words in the match list.
So, for no match:
Benchmark: running BigRE, Index, OptRE, m// loop, each for at least 10
+ CPU seconds...
BigRE: 10 wallclock secs (10.59 usr + 0.00 sys = 10.59 CPU) @ 77
+.59/s (n=822)
Index: 11 wallclock secs (10.50 usr + 0.00 sys = 10.50 CPU) @ 14
+27.58/s (n=14991)
m// loop: 11 wallclock secs (10.47 usr + 0.00 sys = 10.47 CPU) @ 68
+6.50/s (n=7187)
With matches:
Benchmark: running BigRE, Index, OptRE, m// loop, each for at least 10
+ CPU seconds...
BigRE: 10 wallclock secs (10.63 usr + 0.00 sys = 10.63 CPU) @ 20
+25.69/s (n=21523)
Index: 11 wallclock secs (10.47 usr + 0.00 sys = 10.47 CPU) @ 19
+019.11/s (n=199092)
OptRE: 11 wallclock secs (10.51 usr + 0.00 sys = 10.51 CPU) @ 27
+02.33/s (n=28415)
m// loop: 11 wallclock secs (10.45 usr + 0.00 sys = 10.45 CPU) @ 95
+93.84/s (n=100294)
And the test code: |