in reply to Perl regexp matching is slow??

I ran a quick program to sort the regular expressions used in core perl into things that could be turned into DFAs and the ones that can't (anything with back references). There are 1075 transformable regexps and three non-transformable regexps. Potentially we could benefit by using a DFA engine when the regexp doesn't require back references. This has the drawback that the behaviour is going to change. There are usually many potential results from matching a regexp against a string but because of the expected order of execution we get the one we desire. With a DFA, the results that are returned will come out in a different order. This is something that potentially you'd want to indicate is acceptable with a pragma.

If you'd like to read more about this topic see Practical Parsing (an excellent and readable text) and also "The Dragon Book".

⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Replies are listed 'Best First'.
Re^2: Perl regexp matching is slow??
by TimToady (Parson) on Jan 30, 2007 at 21:21 UTC
    You really want finer-grained control than a pragma can provide easily. Perl 6 regexes currently distinguish || (sequential semantics) from | (dfa semantics, at least up till "longest token" point). This is meant to be reminiscent of the distinction between short-circuit and junctional logic in the rest of the language.
Re^2: Perl regexp matching is slow??
by diotalevi (Canon) on Jan 31, 2007 at 04:42 UTC

    Here's the program I wrote to characterize regular expressions. I dumped the output of -Mre=debug after compiling everything in perl and ran this program on that output. It turns out that it decides only three regexps *require* an NFA, 750 require at least a DFA, and 325 could be done with just a plain Boyer-Moore search.

    Generate your data:

    find src/bleadperl -name '*.pm' -o -name '*.pl' -print0 | xargs -n1 -0 + perl -Mre=debug > some-file.txt perl this-program some-file.txt

    The program:

    use strict; use warnings; my %sim; @sim{qw(BOL EXACT END NOTHING SBOL EOL SEOL)} = (); my %dfa; @dfa{qw(ALNUM ANYOF BOL BRANCH BOUND CLOSE1 CLOSE2 CLOSE3 CLOSE4 CLOSE +5 CLOSE6 CLOSE7 CURLY CURLYM CURLYN CURLYX DIGIT END EOL EOS EXACT EXACTF IFMATCH MBOL MEOL MINMOD NALNUM NDIGIT NOTHING NSPACE OPEN1 OPEN2 OPEN3 OPEN4 OPEN5 OPEN6 OPEN7 PLUS REG_ANY SANY SBOL SEOL SPACE STAR SUCCEED TAIL TRIE TRIEC UNLESSM WHILEM)}=(); sub any (&@) { my $predicate = shift @_; $predicate->() and return 1 for @_; return; } local $/; my $data = <>; my ( $bm_ct, $dfa_ct, $nfa_ct ); while ($data =~ /Final program:\n((?:\s[^\n]+\n)+)/mg) { my @ops = $1 =~ / ([A-Z]\w+)/g; if ( not any { not exists $sim{$_} } @ops ) { ++ $bm_ct; } elsif ( not any { not exists $dfa{$_} } @ops ) { ++ $dfa_ct; } else { ++ $nfa_ct; } } print "NFA $nfa_ct / DFA $dfa_ct / BM $bm_ct\n";

    ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

      I think you could rework this to only include [A-Z] in the opcode, if there is a number following the [A-Z] its not important. In other words just look for OPEN not OPEN1 etc.

      Also i think that youll need to add MINMOD to the list of non DFA'able constructs, and maybe also should put OPEN in there as well since capturing with a DFA is distinctly non-trivial (actually im not even sure you can do it with a pure-DFA, maybe with NFA based DFA simulation). And why is NOTHING in both lists?

      ---
      $world=~s/war/peace/g

        NOTHING is in the lists because it showed up as an op in the word extraction. You oughta notice that what I wrote was really, really basic and was just doing existence tests for the named ops.

        ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊