comment on

Here's the program I wrote to characterize regular expressions. I dumped the output of -Mre=debug after compiling everything in perl and ran this program on that output. It turns out that it decides only three regexps *require* an NFA, 750 require at least a DFA, and 325 could be done with just a plain Boyer-Moore search.

Generate your data:

find src/bleadperl -name '*.pm' -o -name '*.pl' -print0 | xargs -n1 -0
+ perl -Mre=debug > some-file.txt
perl this-program some-file.txt
[download]

The program:

use strict;
use warnings;

my %sim;
@sim{qw(BOL EXACT END NOTHING SBOL EOL SEOL)} = ();

my %dfa;
@dfa{qw(ALNUM ANYOF BOL BRANCH BOUND CLOSE1 CLOSE2 CLOSE3 CLOSE4 CLOSE
+5 CLOSE6
CLOSE7 CURLY CURLYM CURLYN CURLYX DIGIT END EOL EOS EXACT EXACTF
IFMATCH MBOL MEOL MINMOD NALNUM NDIGIT NOTHING NSPACE OPEN1 OPEN2
OPEN3 OPEN4 OPEN5 OPEN6 OPEN7 PLUS REG_ANY SANY SBOL SEOL SPACE STAR
SUCCEED TAIL TRIE TRIEC UNLESSM WHILEM)}=();

sub any (&@) {
    my $predicate = shift @_;
    $predicate->() and return 1 for @_;
    return;
}

local $/;
my $data = <>;

my ( $bm_ct, $dfa_ct, $nfa_ct );
while ($data =~ /Final program:\n((?:\s[^\n]+\n)+)/mg) {
    my @ops = $1 =~ / ([A-Z]\w+)/g;

    if ( not any { not exists $sim{$_} } @ops ) {
        ++ $bm_ct;
    }
    elsif ( not any { not exists $dfa{$_} } @ops ) {
        ++ $dfa_ct;
    }
    else {
        ++ $nfa_ct;
    }
}

print "NFA $nfa_ct / DFA $dfa_ct / BM $bm_ct\n";
[download]

⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

In reply to Re^2: Perl regexp matching is slow?? by diotalevi
in thread Perl regexp matching is slow?? by smahesh

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.