Re^4: Regexp for range of numbers

Replies are listed 'Best First'.
Re^5: Regexp for range of numbers by Corion (Patriarch) on Apr 05, 2005 at 20:21 UTC
The relevant perldelta briefly touches on the subject of the Trie optimization for regexes. Basically, I think, the idea is that you convert the constant part of the regex into hashes of hashes that you can then walk very quickly (if you are using C instead of Perl), like the following hash to match the numbers (11,122,233,31) (except that the following is not valid Perl syntax as the hash needs to be created step-by-step to make the internal forward references work): `my %trie; %trie = ( 1 => { 1 => 'accept', 2 => { 2 => 'accept', 3 => \%trie{2}->{3}, # since we already saw a 23 _default => \%trie, } _default => \%trie }, 2 => { 3 => { 3 => 'accept', _default => \%trie{3}, # we already saw a 3 }, _default => \%trie }, 3 => { 1 => 'accept', _default => \%trie }, _default => \%trie; ); # hash walker omitted` [download] The next step will likely be Aho-Corasick optimizations for regexes, which you can watch animated here. Updated: After a discussion with bart, I expanded the example to shed more light on the importance of links within the `%trie` structure. If a match fails for `122`, we can continue trying to match `223` and store the knowledge that we've already looked at `22` in the deep link within `%trie`.	[reply] [d/l]
Re^6: Regexp for range of numbers by demerphq (Chancellor) on Apr 06, 2005 at 08:31 UTC
The structure you have there is actually more like the DFA equvivelent of an A-C matcher. If you remove the default/back links you get a raw trie. Aho-Corasick matching actually uses an array of fail transitions, so you can think of it as the "unfilled in" version of what you have above. This disadvantage of this is that its slightly slower at run time, but the advantage is its a lot faster to construct the array than it is to fill in the trie, it also makes it easier to simulate NFA leftmost-firstoccuring behaviour instead of "first encountered" or "leftmost longest" as you know when you are doing a fail transition and not a success transition. I might add more later. There are a number of nodes on PM that discuss Trie's in detail, with implementation examples and the like. --- demerphq	[reply]