comment on

Fellow believers, I've got a problem with the perl regex compiler. It seems that compliation of combined regexes ( or alternation whatever you call it ) is not optimized.

Using a /(foo|bar)/ regex on strings is slower than using a foreach loop doing the matching one after another. I've written a testprogramm and looked at the perl source to find out why. Now I know. It seems that DFA won't get optimised for the alternation.

As I have no time and knowledge and skill for optimising the perlregex compiler from scratch, what can I do. Programming such foreach loops gives me headaches - it such 'awk'ward.

I need to get those regexes fast as nowadays the strings I'm working on tend to get larger ( e.g. xml-files ) - any idea ?

Jolly

Here's the testprogram for those of you that don't think it's true:

#!/bin/perl 
use strict; 
use Digest::MD5 qw(md5 md5_hex md5_base64); 
use Time::HiRes qw(time ); 
#use re 'debug'  ; 


foreach my $regexcount (1,5,10) 
{ 
        foreach my $regexlength (2,5,10,20) 
        { 
                my @items       = map{ createRandomTextWithLength($reg
+exlength); } 
(1..$regexcount); 
                my $regexstr    = join('|',@items); 
                my $regex               = qr /(?:$regexstr)/; 


                foreach my $stringlength (100,1000,10000,100000) 
                { 
                        print localtime()." Stringlength: $stringlengt
+h Number of 
Regexes:$regexcount Length of each Regex:$regexlength\n"; 


                        my $teststring = createRandomTextWithLength($s
+tringlength); 
                        my $timer; 
                        { 
                                my $test=$teststring; 
                                $timer =time; 
                                $test =~ s/$regex/foobar/g; 
                                printf("ElapsedTime:%5.4f  %20s 
%20s\n",time-$timer,md5_hex($test),$regex); 
                        } 


                        { 
                                my $test=$teststring; 
                                $timer =time; 
                                foreach my $oneregex (@items) 
                                { 
                                        $test =~ s/$oneregex/foobar/g;
+ 
                                } 
                                printf("ElapsedTime:%5.4f  %20s 
%20s\n",time-$timer,md5_hex($test),' for loop over '.join(',',@items))
+; 
                        } 
                        print "\n"; 
                } 
        } 


} 

sub createRandomTextWithLength($) 
{ 
        my($count) = (@_); 
        my $string; 
        for (1.. $count) 
        { 
                $string.=chr(ord('a')+rand(20)); 
        } 
        return $string; 
}
[download]

In reply to Regex combining /(foo|bar)/ slower than using foreach (/foo/,/bar/) ??? by JollyJinx

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.