Re: Perl regexp matching is slow??

In reply to everyone saying that the pathological cases are only a theoretical problem, and are easily avoided, I ran into a pathological case in real life, and found Russ's article via google for perl slow regex.

I have a bunch of PDFs and some metadata I want to verify. Basically I want to check that the file actually contains the title listed in the metadata table, and stuff like that, to make sure the right files have the right names. Punctuation often varies, so I did this:

    for loop over records from metadata table {
    slurp $docid.txt (from pdftotext) into $contents;
    $s = $title;
    $s =~ s/\W+/.+?/g;  # gaps/punctuation between words -> wildcards
    if($contents =~ /$s/sig){
        print "pos = ", pos $contents, ",$s|";
        pos $contents = 0;
    }else{
        print "$textfile didn't contain $s\n";
    }
    }
[download]

So I'm matching a regex that looks like Recogntion.+?of.+?Credit.+?History.+?for.+?New.+?Immigrants against maybe 100kB of text. That regex takes more than 1/2 hour to not match, on a 2.4GHz C2D (AMD64 Ubuntu, perl v5.8.8). (matches are fast, though.)

I'm just using /g here because I wanted to know where the title matched. Matching against the whole file contents is overkill for the title, but I'm planning to check for other metadata, which could appear at the end of the file.

My first attempt didn't use non-greedy wildcards, and took 30s for one easy match near the beginning of the string, and much longer for others. Finding Russ's article clued me in to why it was so slow in the common matching case.

I eventually gave up on perl's regex engine for this task, and replaced the transformation to wildcards with an explicit search for each word in order.

    my @s = split(/\W+/, $s);
    foreach $word (@s){
        if($contents !~ /$word/sig){
        print "$textfile didn't contain $s\n";
        $nomatch = 1;
        last;
        }
    }
    print "pos = ", pos $contents, ",$s|" unless $nomatch;
[download]

This runs on 345 text files totaling 52MB in

real    0m0.150s
user    0m0.044s
sys    0m0.068s
[download]

The version using non-greedy wildcards is still running after 784 minutes (started last night). It's still stuck searching for Broadband.+?in.+?Ontario.+?from.+?an.+?Urban.+?and.+?Regional.+?Perspective.+?digital.+?divides.+?content.+?inflation.+?citizen.+?engagement.+?and.+?criteria.+?for.+?success in a 265k string that doesn't match. (The PDF has the title in an image, so it's not in the pdftotext output).

time tr -d '\n' < 250234.txt | egrep 'the...pattern'
real    0m7.178s
user    0m6.592s
sys    0m0.036s

time tr -d '\n' < 250234.txt | LANG=C egrep 'the...pattern'
real    0m0.027s
user    0m0.012s
sys    0m0.008s
[download]

That 784min perl process is running with LANG=C, too. (I'd rather use a locale where \W didn't match accented vowels, since some of the documents are in French. They're a digital library collection of documents, often EN and FR versions of the same thing from Canadian govt. agencies. Anyway en_CA.utf-8 is not such a locale, so I might as well use LANG=C.)

The point of this post is to say this was a real-world problem for me, and it would have saved me several hours reading about this if perl used a non-backtracking matcher for patterns that didn't need backtracking. I still haven't thought of a regex that's fast (in perl) in the matching and non-matching cases. If a m//g loop didn't work, I could have piped tr -d '\n' | egrep, and it would have been fast enough.

I'm glad to hear that future versions of perl might have a non-backtracking regex engine. Happy hacking.

-- Peter Cordes, peter@cordes.ca

Comment on Re: Perl regexp matching is slow?? Select or Download Code

Replies are listed 'Best First'.
Re^2: Perl regexp matching is slow?? by ripper234 (Initiate) on Aug 12, 2008 at 20:28 UTC
I just wanted to reiterate Peter's view - these pathological cases do come up in practice. My example is this regex, designed to find URLs in HTML documents (pathological in .NET). `href=['"]([^?'">]\??[^'" >])[^>]([^>])</a>` [download] As a user, I do not want to store in my brain the knowledge of which regexes are "safe". If this knowledge is easy to describe, why not code it, and at least for this class of regexes, use the Thompson algorithm? Of course, as an optimization that applies only for some use cases, this has to be prioritized, tested & integrated. But in a perfect PERL, I would want to see it included.	[reply] [d/l]

Replies are listed 'Best First'.

Re^2: Perl regexp matching is slow??
by ripper234 (Initiate) on Aug 12, 2008 at 20:28 UTC

I just wanted to reiterate Peter's view - these pathological cases do come up in practice. My example is this regex, designed to find URLs in HTML documents (pathological in .NET).

href=['"]([^?'">]*\??[^'" >]*)[^>]*([^>]*)</a>
[download]

As a user, I do not want to store in my brain the knowledge of which regexes are "safe". If this knowledge is easy to describe, why not code it, and at least for this class of regexes, use the Thompson algorithm?
Of course, as an optimization that applies only for some use cases, this has to be prioritized, tested & integrated. But in a perfect PERL, I would want to see it included.

[reply]
[d/l]