In reply to everyone saying that the pathological cases are only a theoretical problem, and are easily avoided, I ran into a pathological case in real life, and found Russ's article via google for perl slow regex.

I have a bunch of PDFs and some metadata I want to verify. Basically I want to check that the file actually contains the title listed in the metadata table, and stuff like that, to make sure the right files have the right names. Punctuation often varies, so I did this:

for loop over records from metadata table { slurp $docid.txt (from pdftotext) into $contents; $s = $title; $s =~ s/\W+/.+?/g; # gaps/punctuation between words -> wildcards if($contents =~ /$s/sig){ print "pos = ", pos $contents, ",$s|"; pos $contents = 0; }else{ print "$textfile didn't contain $s\n"; } }

So I'm matching a regex that looks like Recogntion.+?of.+?Credit.+?History.+?for.+?New.+?Immigrants against maybe 100kB of text. That regex takes more than 1/2 hour to not match, on a 2.4GHz C2D (AMD64 Ubuntu, perl v5.8.8). (matches are fast, though.)

I'm just using /g here because I wanted to know where the title matched. Matching against the whole file contents is overkill for the title, but I'm planning to check for other metadata, which could appear at the end of the file.

My first attempt didn't use non-greedy wildcards, and took 30s for one easy match near the beginning of the string, and much longer for others. Finding Russ's article clued me in to why it was so slow in the common matching case.

I eventually gave up on perl's regex engine for this task, and replaced the transformation to wildcards with an explicit search for each word in order.

my @s = split(/\W+/, $s); foreach $word (@s){ if($contents !~ /$word/sig){ print "$textfile didn't contain $s\n"; $nomatch = 1; last; } } print "pos = ", pos $contents, ",$s|" unless $nomatch;
This runs on 345 text files totaling 52MB in
real 0m0.150s user 0m0.044s sys 0m0.068s

The version using non-greedy wildcards is still running after 784 minutes (started last night). It's still stuck searching for Broadband.+?in.+?Ontario.+?from.+?an.+?Urban.+?and.+?Regional.+?Perspective.+?digital.+?divides.+?content.+?inflation.+?citizen.+?engagement.+?and.+?criteria.+?for.+?success in a 265k string that doesn't match. (The PDF has the title in an image, so it's not in the pdftotext output).

time tr -d '\n' < 250234.txt | egrep 'the...pattern' real 0m7.178s user 0m6.592s sys 0m0.036s time tr -d '\n' < 250234.txt | LANG=C egrep 'the...pattern' real 0m0.027s user 0m0.012s sys 0m0.008s
That 784min perl process is running with LANG=C, too. (I'd rather use a locale where \W didn't match accented vowels, since some of the documents are in French. They're a digital library collection of documents, often EN and FR versions of the same thing from Canadian govt. agencies. Anyway en_CA.utf-8 is not such a locale, so I might as well use LANG=C.)

The point of this post is to say this was a real-world problem for me, and it would have saved me several hours reading about this if perl used a non-backtracking matcher for patterns that didn't need backtracking. I still haven't thought of a regex that's fast (in perl) in the matching and non-matching cases. If a m//g loop didn't work, I could have piped tr -d '\n' | egrep, and it would have been fast enough.

I'm glad to hear that future versions of perl might have a non-backtracking regex engine. Happy hacking.

-- Peter Cordes, peter@cordes.ca


In reply to Re: Perl regexp matching is slow?? by Anonymous Monk
in thread Perl regexp matching is slow?? by smahesh

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.