in reply to Re^3: Regexp for range of numbers
in thread Regexp for range of numbers

Very cool. So is there detailed information available somewhere about this new feature? I imagine it's doing some hash or binary tree type searches...

-Scott

Replies are listed 'Best First'.
Re^5: Regexp for range of numbers
by Corion (Patriarch) on Apr 05, 2005 at 20:21 UTC

    The relevant perldelta briefly touches on the subject of the Trie optimization for regexes. Basically, I think, the idea is that you convert the constant part of the regex into hashes of hashes that you can then walk very quickly (if you are using C instead of Perl), like the following hash to match the numbers (11,122,233,31) (except that the following is not valid Perl syntax as the hash needs to be created step-by-step to make the internal forward references work):

    my %trie; %trie = ( 1 => { 1 => 'accept', 2 => { 2 => 'accept', 3 => \%trie{2}->{3}, # since we already saw a 23 _default => \%trie, } _default => \%trie }, 2 => { 3 => { 3 => 'accept', _default => \%trie{3}, # we already saw a 3 }, _default => \%trie }, 3 => { 1 => 'accept', _default => \%trie }, _default => \%trie; ); # hash walker omitted

    The next step will likely be Aho-Corasick optimizations for regexes, which you can watch animated here.

    Updated: After a discussion with bart, I expanded the example to shed more light on the importance of links within the %trie structure. If a match fails for 122, we can continue trying to match 223 and store the knowledge that we've already looked at 22 in the deep link within %trie.

      The structure you have there is actually more like the DFA equvivelent of an A-C matcher. If you remove the default/back links you get a raw trie. Aho-Corasick matching actually uses an array of fail transitions, so you can think of it as the "unfilled in" version of what you have above. This disadvantage of this is that its slightly slower at run time, but the advantage is its a lot faster to construct the array than it is to fill in the trie, it also makes it easier to simulate NFA leftmost-firstoccuring behaviour instead of "first encountered" or "leftmost longest" as you know when you are doing a fail transition and not a success transition.

      I might add more later. There are a number of nodes on PM that discuss Trie's in detail, with implementation examples and the like.

      ---
      demerphq