in reply to Perl regexp matching is slow??

In reply to everyone saying that the pathological cases are only a theoretical problem, and are easily avoided, I ran into a pathological case in real life, and found Russ's article via google for perl slow regex.

I have a bunch of PDFs and some metadata I want to verify. Basically I want to check that the file actually contains the title listed in the metadata table, and stuff like that, to make sure the right files have the right names. Punctuation often varies, so I did this:

for loop over records from metadata table { slurp $docid.txt (from pdftotext) into $contents; $s = $title; $s =~ s/\W+/.+?/g; # gaps/punctuation between words -> wildcards if($contents =~ /$s/sig){ print "pos = ", pos $contents, ",$s|"; pos $contents = 0; }else{ print "$textfile didn't contain $s\n"; } }

So I'm matching a regex that looks like Recogntion.+?of.+?Credit.+?History.+?for.+?New.+?Immigrants against maybe 100kB of text. That regex takes more than 1/2 hour to not match, on a 2.4GHz C2D (AMD64 Ubuntu, perl v5.8.8). (matches are fast, though.)

I'm just using /g here because I wanted to know where the title matched. Matching against the whole file contents is overkill for the title, but I'm planning to check for other metadata, which could appear at the end of the file.

My first attempt didn't use non-greedy wildcards, and took 30s for one easy match near the beginning of the string, and much longer for others. Finding Russ's article clued me in to why it was so slow in the common matching case.

I eventually gave up on perl's regex engine for this task, and replaced the transformation to wildcards with an explicit search for each word in order.

my @s = split(/\W+/, $s); foreach $word (@s){ if($contents !~ /$word/sig){ print "$textfile didn't contain $s\n"; $nomatch = 1; last; } } print "pos = ", pos $contents, ",$s|" unless $nomatch;
This runs on 345 text files totaling 52MB in
real 0m0.150s user 0m0.044s sys 0m0.068s

The version using non-greedy wildcards is still running after 784 minutes (started last night). It's still stuck searching for Broadband.+?in.+?Ontario.+?from.+?an.+?Urban.+?and.+?Regional.+?Perspective.+?digital.+?divides.+?content.+?inflation.+?citizen.+?engagement.+?and.+?criteria.+?for.+?success in a 265k string that doesn't match. (The PDF has the title in an image, so it's not in the pdftotext output).

time tr -d '\n' < 250234.txt | egrep 'the...pattern' real 0m7.178s user 0m6.592s sys 0m0.036s time tr -d '\n' < 250234.txt | LANG=C egrep 'the...pattern' real 0m0.027s user 0m0.012s sys 0m0.008s
That 784min perl process is running with LANG=C, too. (I'd rather use a locale where \W didn't match accented vowels, since some of the documents are in French. They're a digital library collection of documents, often EN and FR versions of the same thing from Canadian govt. agencies. Anyway en_CA.utf-8 is not such a locale, so I might as well use LANG=C.)

The point of this post is to say this was a real-world problem for me, and it would have saved me several hours reading about this if perl used a non-backtracking matcher for patterns that didn't need backtracking. I still haven't thought of a regex that's fast (in perl) in the matching and non-matching cases. If a m//g loop didn't work, I could have piped tr -d '\n' | egrep, and it would have been fast enough.

I'm glad to hear that future versions of perl might have a non-backtracking regex engine. Happy hacking.

-- Peter Cordes, peter@cordes.ca

Replies are listed 'Best First'.
Re^2: Perl regexp matching is slow??
by ripper234 (Initiate) on Aug 12, 2008 at 20:28 UTC

    I just wanted to reiterate Peter's view - these pathological cases do come up in practice. My example is this regex, designed to find URLs in HTML documents (pathological in .NET).

    href=['"]([^?'">]*\??[^'" >]*)[^>]*([^>]*)</a>

    As a user, I do not want to store in my brain the knowledge of which regexes are "safe". If this knowledge is easy to describe, why not code it, and at least for this class of regexes, use the Thompson algorithm?
    Of course, as an optimization that applies only for some use cases, this has to be prioritized, tested & integrated. But in a perfect PERL, I would want to see it included.