how to speed up that dynamic regex?

rsFalse has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I tried to solve problem of nested parentheses. I have read in J.Friedl's book (Mastering Regular Expressions II ed.) about dynamic regexes - (??{code}).

The problem is so: You are given a sequence of (opening|closing) parentheses. What is the longest subsequence which is valid nested parentheses <for example: (), ()(), (()), ((())())>, and how much subsequences of that length occurs in the sequnce?

Here is the code how I am solving that problem:

@lines = <>;

$n=1000;
$long = '';
srand;
# generating one random sequence of length $n:
while ($n--){
    $long .= qw<( )>[int (rand()*2) ]
    }
    
$m;
$m = qr/\((?:|(??{$m})++)*\)/; # main thing

for (@lines, $long){
    undef %occ;
    map { ++$occ{(length)} } /$m+/g;
    $max = 0;
    $max < $_ and $max = $_ for keys %occ;
    print $max," ",$occ{$max} || 1,$/;
    }
[download]

# Example input:
)((())))(()())
))(
()(())()
((((()(((
(()())()(())()()())())()((()(()(())()()())((()(())()(()()()()))()(())(
+)(((()())()(()((())()(())(()))

# Example output:
6 2
0 1
8 1
2 1
28 1
662 1
[download]

You can test it in: http://ideone.com/X8BUwy

Program works in different time, although $n is set to 1000. Sometimes it works 0.0 sec, sometimes >5 sec.

Any suggestions how to speed up regex? And how can someone determine when (with what sequences?) program will run slow?

Comment on how to speed up that dynamic regex? Select or Download Code

Replies are listed 'Best First'.

Re: how to speed up that dynamic regex?
by Athanasius (Archbishop) on Nov 04, 2014 at 13:42 UTC

Hello rsFalse,

First, the code you have does not appear to be producing correct results. For example, you have this line of input:

()(())()
[download]

which produces:

8 1
[download]

but by inspection we can see that the longest subsequence of validly-nested parentheses is (()) which is only 4 characters long.

Second, as Eily says you can get a significant speedup by using (?PARNO). For example, adapting the code in Can-I-use-Perl-regular-expressions-to-match-balanced-text of perlfaq6, I came up with this:

use strict;
use warnings;
use List::Util 'max';

srand;
my @lines = <DATA>;
my $n     = 1000;
my $long  = '';
   $long .= qw{( )}[ int(rand() * 2) ] for 1 .. $n;
my $regex =
qr{
    (                       # start of capture group 1
        \(                  # match an opening parenthesis
            (?:
                [^()]++     # one or more non-parentheses, non backtra
+cking
                |           # OR
                (?1)        # recurse to capture group 1
            )*
        \)                  # match a closing parenthesis
    )                       # end of capture group 1
}x;

for my $string (@lines, $long)
{
    my $max    = 0;
    my $occ    = 1;
    my @groups = $string =~ m/$regex/g;

    if (@groups)
    {
        @groups = sort { length $a <=> length $b } @groups;
        $max = length $groups[-1];

        for my $i (reverse 0 .. $#groups - 1)
        {
            if (length $groups[$i] == length $groups[-1])
            {
                ++$occ;
            }
            else
            {
                last;
            }
        }
    }

    print "$max $occ\n";
}

__DATA__
)((())))(()())
))(
()(())()
((((()(((
(()())()(())()()())())()((()(()(())()()())((()(())()(()()()()))()(())(
+)(((()())()(()((())()(())(()))
[download]

Typical output:

23:28 >perl 1071_SoPW.pl
6 2
0 1
4 1
2 1
20 1
266 1

23:28 >
[download]

For a long string this is much faster than using (??{ code }).

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: how to speed up that dynamic regex?

by AnomalousMonk (Archbishop) on Nov 05, 2014 at 01:59 UTC

my $regex = qr{ ( # start of capture group 1 ... (?1) # recurse to capture group 1 ... ) # end of capture group 1 }x;
[download]

A neat feature of the (?PARNO) extended pattern (available with Perl versions 5.10+) is that the numbering of capture groups can be made relative instead of absolute. This shines brightest when defining qr// regex objects, which are designed to be interpolated into other qr// m// s/// expressions. The logic of group recursion can be encapsulated and made independent of whatever other expressions go into the final regex.

In Athanasius's code above, if even one more capturing group sneaks into the m// ahead of $regex in the extraction expression
my @groups = $string =~ m/$regex/g;
as in
my @groups = $string =~ m/(x?)$regex/g;
capture group numbering is thrown off and its function is destroyed. If the absolute (?1) group recursion in
my $regex = qr{ (... (?1) ...) }x;
is made relative with (?-1) then any number of extra preceding capture groups will make no difference to its function:
my @groups = $string =~ m/(x?)(x?)(x?)$regex/g;

[reply]
[d/l]
[select]

Re^3: how to speed up that dynamic regex?

by rsFalse (Chaplain) on Nov 07, 2014 at 10:44 UTC

my $regex =
qr{
                           # start of full pattern
        \(                  # match an opening parenthesis
            (?:
                0        # everytime false, because of such DATA
                |           # OR
                (?0)        # recurse to capture full pattern
            )*
        \)                  # match a closing parenthesis
                           # end of  full pattern
}x;
[download]

[reply]
[d/l]

Re^4: how to speed up that dynamic regex?

by rsFalse (Chaplain) on Nov 07, 2014 at 11:28 UTC

Re^2: how to speed up that dynamic regex?

by rsFalse (Chaplain) on Nov 04, 2014 at 23:45 UTC

Such verbose! :) Thanks! I'll read about (?PARNO).

[reply]

Re: how to speed up that dynamic regex?
by Eily (Monsignor) on Nov 04, 2014 at 12:20 UTC

You have this problem (the nested parenthesis problem I mean) solved the way you did in perlre. And just below it:

See also (?PARNO) for a different, more efficient way to accomplish the same task.

[reply]