Fellow believers, I've got a problem with the perl regex compiler. It seems that compliation of combined regexes ( or alternation whatever you call it ) is not optimized.

Using a /(foo|bar)/ regex on strings is slower than using a foreach loop doing the matching one after another. I've written a testprogramm and looked at the perl source to find out why. Now I know. It seems that DFA won't get optimised for the alternation.

As I have no time and knowledge and skill for optimising the perlregex compiler from scratch, what can I do. Programming such foreach loops gives me headaches - it such 'awk'ward.

I need to get those regexes fast as nowadays the strings I'm working on tend to get larger ( e.g. xml-files ) - any idea ?

Jolly

Here's the testprogram for those of you that don't think it's true:

#!/bin/perl use strict; use Digest::MD5 qw(md5 md5_hex md5_base64); use Time::HiRes qw(time ); #use re 'debug'  ; foreach my $regexcount (1,5,10) {         foreach my $regexlength (2,5,10,20)         {                 my @items       = map{ createRandomTextWithLength($reg +exlength); } (1..$regexcount);                 my $regexstr    = join('|',@items);                 my $regex               = qr /(?:$regexstr)/;                 foreach my $stringlength (100,1000,10000,100000)                 {                         print localtime()." Stringlength: $stringlengt +h Number of Regexes:$regexcount Length of each Regex:$regexlength\n";                         my $teststring = createRandomTextWithLength($s +tringlength);                         my $timer;                         {                                 my $test=$teststring;                                 $timer =time;                                 $test =~ s/$regex/foobar/g;                                 printf("ElapsedTime:%5.4f  %20s %20s\n",time-$timer,md5_hex($test),$regex);                         }                         {                                 my $test=$teststring;                                 $timer =time;                                 foreach my $oneregex (@items)                                 {                                         $test =~ s/$oneregex/foobar/g; +                                 }                                 printf("ElapsedTime:%5.4f  %20s %20s\n",time-$timer,md5_hex($test),' for loop over '.join(',',@items)) +;                         }                         print "\n";                 }         } } sub createRandomTextWithLength($) {         my($count) = (@_);         my $string;         for (1.. $count)         {                 $string.=chr(ord('a')+rand(20));         }         return $string; }

In reply to Regex combining /(foo|bar)/ slower than using foreach (/foo/,/bar/) ???   by JollyJinx

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.