in reply to Re: Benchmarking regex alternation
in thread Benchmarking regex alternation
One would have to benchmark to be sure, but i think that for two plain text strings /foo/ or /bar/ is going to be faster, with longer strings being even more efficient. But as you scale it up in terms of the number of alternations there will come a point where the TRIE/Aho-Corasick optimsation in perl 5.10 should win. In all honesty I did do some benchmarking to see where you win with alternation and I was surprised at how many fixed strings you need to win: around 20 with four letter words in a 500 byte string made up of lower case letters. I need to look into why that is and see if it can be improved.
And there shouldnt really be a much of a scaling factor as you add more words to the trie. Naturally there actually are due to issues like cache spoilation and things like that. To be honest IMO the trie code is not running as fast as it could. I've been thinking that I made some bad design decisions that if changed could make a signfigant improvement in performance at the expense of some memory. Nevertheless the difference in how the old engine and new engine handle alternation can be quite drammatic. I haven't made anything worse as far as I can tell, and in most cases its measurably better.
Update: corion asked me to expand what I meant by bad design decisions. So here goes: one key issue is that I used a compression algorithm for representing the transition table of the trie. The idea is that it gives constant time lookup but without the waste of a normal state transition table, which in the case of a trie is normally quite high, often nearly worst case at length(@w) * @a elements (@w being the words, and @a being the alphabet). Unfortunately however this compression raises the cost per lookup quite a bit, and for small alternations (say with a few short sequences) the end result is a quite a hit in terms of per operation cost for what is in the end a negligable space saving. I should have made a non compressed form for small tables and avoided the per operation cost. It most cases it wouldnt make a difference in terms of memory, but would show a nice speedup.
The other optimisation concerns character set issues and unicode. Its easy to get into the mode of thinking of unicode as codepoints, and Im not so sure that I should have. The consequence of this is that you can end up with an extremely large alphabet. What I should have done is treated utf8 as a byte stream and done some monkey business to deal with non unicode high bit characters. This would have made the per character price cheaper for unicode, and simplified the logic in the matching code with a resulting improvement in general run time, and also reduced the need for table compression when dealing with large char sets, as the internal alphabet size would never exceed 256.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Benchmarking regex alternation
by diotalevi (Canon) on Jan 30, 2007 at 16:10 UTC |