comment on

In reply to everyone saying that the pathological cases are only a theoretical problem, and are easily avoided, I ran into a pathological case in real life, and found Russ's article via google for perl slow regex.

I have a bunch of PDFs and some metadata I want to verify. Basically I want to check that the file actually contains the title listed in the metadata table, and stuff like that, to make sure the right files have the right names. Punctuation often varies, so I did this:

    for loop over records from metadata table {
    slurp $docid.txt (from pdftotext) into $contents;
    $s = $title;
    $s =~ s/\W+/.+?/g;  # gaps/punctuation between words -> wildcards
    if($contents =~ /$s/sig){
        print "pos = ", pos $contents, ",$s|";
        pos $contents = 0;
    }else{
        print "$textfile didn't contain $s\n";
    }
    }
[download]

So I'm matching a regex that looks like Recogntion.+?of.+?Credit.+?History.+?for.+?New.+?Immigrants against maybe 100kB of text. That regex takes more than 1/2 hour to not match, on a 2.4GHz C2D (AMD64 Ubuntu, perl v5.8.8). (matches are fast, though.)

I'm just using /g here because I wanted to know where the title matched. Matching against the whole file contents is overkill for the title, but I'm planning to check for other metadata, which could appear at the end of the file.

My first attempt didn't use non-greedy wildcards, and took 30s for one easy match near the beginning of the string, and much longer for others. Finding Russ's article clued me in to why it was so slow in the common matching case.

I eventually gave up on perl's regex engine for this task, and replaced the transformation to wildcards with an explicit search for each word in order.

    my @s = split(/\W+/, $s);
    foreach $word (@s){
        if($contents !~ /$word/sig){
        print "$textfile didn't contain $s\n";
        $nomatch = 1;
        last;
        }
    }
    print "pos = ", pos $contents, ",$s|" unless $nomatch;
[download]

This runs on 345 text files totaling 52MB in

real    0m0.150s
user    0m0.044s
sys    0m0.068s
[download]

The version using non-greedy wildcards is still running after 784 minutes (started last night). It's still stuck searching for Broadband.+?in.+?Ontario.+?from.+?an.+?Urban.+?and.+?Regional.+?Perspective.+?digital.+?divides.+?content.+?inflation.+?citizen.+?engagement.+?and.+?criteria.+?for.+?success in a 265k string that doesn't match. (The PDF has the title in an image, so it's not in the pdftotext output).

time tr -d '\n' < 250234.txt | egrep 'the...pattern'
real    0m7.178s
user    0m6.592s
sys    0m0.036s

time tr -d '\n' < 250234.txt | LANG=C egrep 'the...pattern'
real    0m0.027s
user    0m0.012s
sys    0m0.008s
[download]

That 784min perl process is running with LANG=C, too. (I'd rather use a locale where \W didn't match accented vowels, since some of the documents are in French. They're a digital library collection of documents, often EN and FR versions of the same thing from Canadian govt. agencies. Anyway en_CA.utf-8 is not such a locale, so I might as well use LANG=C.)

The point of this post is to say this was a real-world problem for me, and it would have saved me several hours reading about this if perl used a non-backtracking matcher for patterns that didn't need backtracking. I still haven't thought of a regex that's fast (in perl) in the matching and non-matching cases. If a m//g loop didn't work, I could have piped tr -d '\n' | egrep, and it would have been fast enough.

I'm glad to hear that future versions of perl might have a non-backtracking regex engine. Happy hacking.

-- Peter Cordes, peter@cordes.ca

In reply to Re: Perl regexp matching is slow?? by Anonymous Monk
in thread Perl regexp matching is slow?? by smahesh

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.