in reply to Re: Perl regexp matching is slow??
in thread Perl regexp matching is slow??

Here's the program I wrote to characterize regular expressions. I dumped the output of -Mre=debug after compiling everything in perl and ran this program on that output. It turns out that it decides only three regexps *require* an NFA, 750 require at least a DFA, and 325 could be done with just a plain Boyer-Moore search.

Generate your data:

find src/bleadperl -name '*.pm' -o -name '*.pl' -print0 | xargs -n1 -0 + perl -Mre=debug > some-file.txt perl this-program some-file.txt

The program:

use strict; use warnings; my %sim; @sim{qw(BOL EXACT END NOTHING SBOL EOL SEOL)} = (); my %dfa; @dfa{qw(ALNUM ANYOF BOL BRANCH BOUND CLOSE1 CLOSE2 CLOSE3 CLOSE4 CLOSE +5 CLOSE6 CLOSE7 CURLY CURLYM CURLYN CURLYX DIGIT END EOL EOS EXACT EXACTF IFMATCH MBOL MEOL MINMOD NALNUM NDIGIT NOTHING NSPACE OPEN1 OPEN2 OPEN3 OPEN4 OPEN5 OPEN6 OPEN7 PLUS REG_ANY SANY SBOL SEOL SPACE STAR SUCCEED TAIL TRIE TRIEC UNLESSM WHILEM)}=(); sub any (&@) { my $predicate = shift @_; $predicate->() and return 1 for @_; return; } local $/; my $data = <>; my ( $bm_ct, $dfa_ct, $nfa_ct ); while ($data =~ /Final program:\n((?:\s[^\n]+\n)+)/mg) { my @ops = $1 =~ / ([A-Z]\w+)/g; if ( not any { not exists $sim{$_} } @ops ) { ++ $bm_ct; } elsif ( not any { not exists $dfa{$_} } @ops ) { ++ $dfa_ct; } else { ++ $nfa_ct; } } print "NFA $nfa_ct / DFA $dfa_ct / BM $bm_ct\n";

⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Replies are listed 'Best First'.
Re^3: Perl regexp matching is slow??
by demerphq (Chancellor) on Feb 27, 2007 at 22:02 UTC

    I think you could rework this to only include [A-Z] in the opcode, if there is a number following the [A-Z] its not important. In other words just look for OPEN not OPEN1 etc.

    Also i think that youll need to add MINMOD to the list of non DFA'able constructs, and maybe also should put OPEN in there as well since capturing with a DFA is distinctly non-trivial (actually im not even sure you can do it with a pure-DFA, maybe with NFA based DFA simulation). And why is NOTHING in both lists?

    ---
    $world=~s/war/peace/g

      NOTHING is in the lists because it showed up as an op in the word extraction. You oughta notice that what I wrote was really, really basic and was just doing existence tests for the named ops.

      ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊