PetaMem has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I came around this old code of mine:

# Get the n-th or innermost (if n = -1) argument of # property. Empty string if property isn't parametrized sub get_proparg { my $propstr = shift; # get the property string my $level = shift || 0; # get the level we want to extract my $back = $propstr; # Initialize Return value (for case lev +el=0) my $cnt; # initialize counter if($level == -1) { # special case, get the innermost argum +ent $propstr =~ /\(([^\(\)]+)\)+/; $propstr = $1; } else { # get whatever argument $level indicate +s for($cnt = 0;$cnt<$level; $cnt++) { $propstr =~ /\((.+)\)/; $propstr = $1; } } return $propstr; }
What it basically does is to remove a properties outer peeling like an onion: ;-)
"hello(what(is(this(all(about)))))" 0
get_proparg returned: >>hello(what(is(this(all(about)))))<<
"hello(what(is(this(all(about)))))" 1
get_proparg returned: >>what(is(this(all(about))))<<
"hello(what(is(this(all(about)))))" 2
get_proparg returned: >>is(this(all(about)))<<
"hello(what(is(this(all(about)))))" 3
get_proparg returned: >>this(all(about))<<
"hello(what(is(this(all(about)))))" 4
get_proparg returned: >>all(about)<<
"hello(what(is(this(all(about)))))" 5
get_proparg returned: >>about<<
"hello(what(is(this(all(about)))))" -1
get_proparg returned: >>about<<
As this code happens to be time critical I came around and tried to refactor it. The biggest culprit seems to be the iterative for loop which is of course linear with the number of levels we have to descend. We may surely assume that the parentheses are balanced. But unfortunatedly the mightyness of Regexps doesn't cover "balanced parentheses matching".

Anyone a suggestion how to get rid of the loop?

Bye
 PetaMem
    All Perl:   MT, NLP, NLU

Replies are listed 'Best First'.
Re: Peeling the Peelings
by l2kashe (Deacon) on Jul 01, 2003 at 19:32 UTC
    Here is my contribution

    sub get_inner { my $str = shift; my $inner = ( split(/\(/, $str) )[-1]; $inner =~ s/\).+$//; $inner; }
    Im sure it could be golfed, but *shrug* HTH
    Update: or the reverse would be
    sub get_inner { my $str = shift; my $inner = ( split(/\)/, $str) )[0]; $inner =~ m/([^\(]+)$/; $1; }


    MMMMM... Chocolaty Perl Goodness.....
Re: Peeling the Peelings
by Zaxo (Archbishop) on Jul 01, 2003 at 19:12 UTC

    I think &Text::Balanced::extract_multiple does what you want.

    After Compline,
    Zaxo

      Yes, but two orders of magnitude slower than the solution I wanted to speed up. ;-)

      Bye
       PetaMem
          All Perl:   MT, NLP, NLU

Re: Peeling the Peelings
by gjb (Vicar) on Jul 01, 2003 at 19:26 UTC

    The following should do what you want if the input is in the format you specify, i.e. nested function with single arguments.

    my $example1 = 'this(is(a(test)))'; foreach (0..4) { print "level $_: '", peel($example1, $_), "'\n"; } print "level -1: '", peel($example1, -1), "'\n"; sub peel { my ($str, $level) = @_; if ($level < 0) { $str =~ /([^\(\)]+)\)+\Z/; return $1; } my $opening = '(?:[^\(]+\(){' . $level . '}'; my $closing = '\){' . $level . '}'; $str =~ /\A$opening(.+?)$closing\Z/; return $1; }

    Hope this helps, -gjb-

      Rate Peel Get_Proparg Peel 1359/s -- -76% Get_Proparg 5780/s 325% --
      Looks like thats a wee bit slower.

      -Waswas

        It's easy to optimize a bit, although it's still 20% slower than Aristotle's version :-(

        my %cache = ('-1' => qr/([^\(\)]+)\)+\Z/); sub peel { my $str = shift; my $level = shift || 0; if (!exists $cache{$level}) { my $opening = '[^\(]+\(' x $level; my $closing = '\)' x $level; $cache{$level} = qr/\A$opening(.+?)$closing\Z/; } $str =~ $cache{$level}; return $1; }
        The benchmark results are:
        Rate Fatvamp gjb bobn Aristotle Fatvamp 21155/s -- -14% -17% -28% gjb 24631/s 16% -- -3% -16% bobn 25523/s 21% 4% -- -13% Aristotle 29497/s 39% 20% 16% --

        Best regards, -gjb-

Re: Peeling the Peelings
by PetaMem (Priest) on Jul 02, 2003 at 05:11 UTC
    Thanks for all your comments.

    After reading the posts - especially the benchmarks, it seems that the code is not as bad as I thought. However just by looking at it once again and looking at the "80% speedup" for the level 0 case, I came with 2 improvements:

    sub get_proparg { my $propstr = shift; # get the property string my $level = shift || return $propstr; # get the level we want to +extract my $cnt; # initialize counter if($level == -1) { # special case, get the innermost argum +ent $propstr =~ /\(([^\(\)]+)\)+/; return $1; } else { # get whatever argument $level indicate +s for($cnt = 0;$cnt<$level; $cnt++) { $propstr =~ /\((.+)\)/; $propstr = $1; } return $propstr; } }
    • No need for $back - it wasn't used anyway ;-)
    • Immediate return if level = 0; That's straight :-)
    • No unnecessary temporary $var = $var assignments
    Should perform slightly better now, esp. for the level 0 case (which unfortunatedly is rare).

    Bye
     PetaMem
        All Perl:   MT, NLP, NLU

      Sorry, dude, I still 0wn y0u! ;-) Results:
      Benchmark: timing 25000 iterations of mine-elegant, mine-less-elegant, + original, original++... mine-elegant: 8 wallclock secs ( 8.20 usr + 0.00 sys = 8.20 CPU) @ +3048.78/s (n=25000) mine-less-elegant: 6 wallclock secs ( 5.64 usr + 0.01 sys = 5.65 CP +U) @ 4424.78/s (n=25000) original: 9 wallclock secs ( 9.29 usr + 0.00 sys = 9.29 CPU) @ 26 +91.07/s (n=25000) original++: 9 wallclock secs ( 9.33 usr + 0.00 sys = 9.33 CPU) @ 26 +79.53/s (n=25000) Benchmark: timing 25000 iterations of mine-elegant, mine-less-elegant, + original, original++... mine-elegant: 9 wallclock secs ( 8.67 usr + 0.00 sys = 8.67 CPU) @ +2883.51/s (n=25000) mine-less-elegant: 8 wallclock secs ( 6.11 usr + 0.05 sys = 6.16 CP +U) @ 4058.44/s (n=25000) original: 9 wallclock secs ( 8.67 usr + 0.01 sys = 8.68 CPU) @ 28 +80.18/s (n=25000) original++: 9 wallclock secs ( 8.72 usr + 0.01 sys = 8.73 CPU) @ 28 +63.69/s (n=25000) Benchmark: timing 25000 iterations of mine-elegant, mine-less-elegant, + original, original++... mine-elegant: 9 wallclock secs ( 8.40 usr + 0.00 sys = 8.40 CPU) @ +2976.19/s (n=25000) mine-less-elegant: 6 wallclock secs ( 5.22 usr + 0.02 sys = 5.24 CP +U) @ 4770.99/s (n=25000) original: 9 wallclock secs ( 8.33 usr + 0.00 sys = 8.33 CPU) @ 30 +01.20/s (n=25000) original++: 8 wallclock secs ( 7.88 usr + 0.02 sys = 7.90 CPU) @ 31 +64.56/s (n=25000)


      --Bob Niederman, http://bob-n.com
        Hehe - you actually do. ++

        But not that much anymore: ;-)

        sub get_proparg_new { my $propstr = shift; # get the property string my $level = shift || return $propstr; # get the level we want to e +xtract if($level == -1) { # special case, get the innermost argume +nt $propstr =~ /\(([^()]+)\)+/; return $1; } else { # get whatever argument $level indicates while ($level--) { $propstr =~ /\((.+)\)/; $propstr = $1; } } return $propstr; }
        Benchmark: timing 25000 iterations of mine-elegant, mine-less-elegant, original, original++...
        mine-elegant:  5 wallclock secs ( 4.61 usr +  0.00 sys =  4.61 CPU) @ 5422.99/s (n=25000)
        mine-less-elegant:  2 wallclock secs ( 2.16 usr +  0.00 sys =  2.16 CPU) @ 11574.07/s (n=25000)
          original:  2 wallclock secs ( 2.55 usr +  0.00 sys =  2.55 CPU) @ 9803.92/s (n=25000)
        original++:  3 wallclock secs ( 2.32 usr +  0.00 sys =  2.32 CPU) @ 10775.86/s (n=25000)
        
        I guess I'll trade that performance loss against readability. But thanks for the Ideas I could borrow and reuse.

        Bye
         PetaMem
            All Perl:   MT, NLP, NLU

        Bow down. ;)
        Aristotle => sub { my $str = shift; my $level = shift || return $str; $str =~ /.*\((.*?)\)/ and return $1 if $level == -1; local $_ = substr $str, 0, -$level; /\G.*?\(/gc while $level--; return substr $_, pos; },
        ok 1 - by Aristotle for case -1 ok 2 - by Aristotle for case 0 ok 3 - by Aristotle for case 1 ok 4 - by Aristotle for case 2 ok 5 - by Aristotle for case 3 ok 6 - by Aristotle for case 4 ok 7 - by Aristotle for case 5 ok 8 - by bobn for case -1 ok 9 - by bobn for case 0 ok 10 - by bobn for case 1 ok 11 - by bobn for case 2 ok 12 - by bobn for case 3 ok 13 - by bobn for case 4 ok 14 - by bobn for case 5 ok 15 - by Fatvamp for case -1 ok 16 - by Fatvamp for case 0 ok 17 - by Fatvamp for case 1 ok 18 - by Fatvamp for case 2 ok 19 - by Fatvamp for case 3 ok 20 - by Fatvamp for case 4 ok 21 - by Fatvamp for case 5
        Benchmark: running Aristotle, Fatvamp, bobn for at least 2 CPU seconds +... Aristotle: 2 wallclock secs ( 2.12 usr + 0.01 sys = 2.13 CPU) @ 25 +239.44/s (n=53760) Fatvamp: 2 wallclock secs ( 2.10 usr + 0.00 sys = 2.10 CPU) @ 15 +751.90/s (n=33079) bobn: 2 wallclock secs ( 2.05 usr + 0.00 sys = 2.05 CPU) @ 22 +247.80/s (n=45608) Rate Fatvamp bobn Aristotle Fatvamp 15752/s -- -29% -38% bobn 22248/s 41% -- -12% Aristotle 25239/s 60% 13% -- 1..21

        Update

        Here's another take with cached pregenerated regexes to remove the loop. It performs exactly the same as my simpler version, to the point that they sometimes achieve the exact same iterations/sec.
        Aristotle2 => sub { my $str = shift; my $level = shift || return $str; $str =~ /.*\((.*?)\)/ and return $1 if $level == -1; local $_ = substr $str, 0, -$level; my $rx = $rx_hash{$level} ||= do { my $rx = join '', '\A', '(?>.*?\()' x $level; qr/$rx/; }; m/$rx/g && return substr $_, pos; },
        It needs my %rx_hash; at the top of the script somewhere of course.

        Makeshifts last the longest.

Re: Peeling the Peelings
by BrowserUk (Patriarch) on Jul 01, 2003 at 21:04 UTC

    This avoids loops (external to the regex at least). It might compare favorably if you remove the balance check code.

    sub peeln { local $_ = shift; my $n = shift; my($o, $c) = (tr[(][], tr[)][]); warn 'Unbalanced parens' unless $o == $c; $n = $c if $n == -1 or $n > $c; m[^ (?: .*? \( .*?){$n} (.+) (?: .*? \) .*?){$n} $]x; $1; } my @tests = ( 'hello(what(is(this(all(about)))))', 'hello(what(is(this(all(about))))))', 'hello((what(is(this(all(about)))))', 'A(B(c) D(e) f(g(h(i))))', ); for my $test ( @tests ) { print "$test : $_ : ", peeln( $test, $_ ) for -1 .. 5; } __END__ P:\>270595 hello(what(is(this(all(about))))) : -1 : about hello(what(is(this(all(about))))) : 0 : hello(what(is(this(all(about)) +))) hello(what(is(this(all(about))))) : 1 : what(is(this(all(about)))) hello(what(is(this(all(about))))) : 2 : is(this(all(about))) hello(what(is(this(all(about))))) : 3 : this(all(about)) hello(what(is(this(all(about))))) : 4 : all(about) hello(what(is(this(all(about))))) : 5 : about Unbalanced parens at P:\270595.pl8 line 28. Use of uninitialized value in print at P:\270595.pl8 line 42. hello(what(is(this(all(about)))))) : -1 : Unbalanced parens at P:\270595.pl8 line 28. hello(what(is(this(all(about)))))) : 0 : hello(what(is(this(all(about) +))))) Unbalanced parens at P:\270595.pl8 line 28. hello(what(is(this(all(about)))))) : 1 : what(is(this(all(about))))) Unbalanced parens at P:\270595.pl8 line 28. hello(what(is(this(all(about)))))) : 2 : is(this(all(about)))) Unbalanced parens at P:\270595.pl8 line 28. hello(what(is(this(all(about)))))) : 3 : this(all(about))) Unbalanced parens at P:\270595.pl8 line 28. hello(what(is(this(all(about)))))) : 4 : all(about)) Unbalanced parens at P:\270595.pl8 line 28. hello(what(is(this(all(about)))))) : 5 : about) Unbalanced parens at P:\270595.pl8 line 28. hello((what(is(this(all(about))))) : -1 : all(about Unbalanced parens at P:\270595.pl8 line 28. hello((what(is(this(all(about))))) : 0 : hello((what(is(this(all(about +))))) Unbalanced parens at P:\270595.pl8 line 28. hello((what(is(this(all(about))))) : 1 : (what(is(this(all(about)))) Unbalanced parens at P:\270595.pl8 line 28. hello((what(is(this(all(about))))) : 2 : what(is(this(all(about))) Unbalanced parens at P:\270595.pl8 line 28. hello((what(is(this(all(about))))) : 3 : is(this(all(about)) Unbalanced parens at P:\270595.pl8 line 28. hello((what(is(this(all(about))))) : 4 : this(all(about) Unbalanced parens at P:\270595.pl8 line 28. hello((what(is(this(all(about))))) : 5 : all(about Use of uninitialized value in print at P:\270595.pl8 line 42. A(B(c) D(e) f(g(h(i)))) : -1 : A(B(c) D(e) f(g(h(i)))) : 0 : A(B(c) D(e) f(g(h(i)))) A(B(c) D(e) f(g(h(i)))) : 1 : B(c) D(e) f(g(h(i))) A(B(c) D(e) f(g(h(i)))) : 2 : c) D(e) f(g(h(i)) A(B(c) D(e) f(g(h(i)))) : 3 : e) f(g(h(i) A(B(c) D(e) f(g(h(i)))) : 4 : g(h(i Use of uninitialized value in print at P:\270595.pl8 line 42. A(B(c) D(e) f(g(h(i)))) : 5 :

    It doesn't handle (?term?) nesting like this a(b(c) d(e)), but it was unclear to me what the return would be in these circumstances?


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


      Actually, your cosde is a lot slower, once again, go figure: Results:
      Benchmark: timing 15000 iterations of his, mine, mine-less-lgnt, other +... his: 5 wallclock secs ( 4.94 usr + 0.00 sys = 4.94 CPU) @ 30 +36.44/s (n=15000) mine: 5 wallclock secs ( 4.94 usr + 0.01 sys = 4.95 CPU) @ 30 +30.30/s (n=15000) mine-less-lgnt: 3 wallclock secs ( 3.43 usr + 0.00 sys = 3.43 CPU) +@ 4373.18/s (n=15000) other: 22 wallclock secs (20.89 usr + 0.03 sys = 20.92 CPU) @ 71 +7.02/s (n=15000)


      --Bob Niederman, http://bob-n.com

        UpdateThis version is competely wrong!! It is quick because it does nothing at all. I benchmarked this, but verified the output of a completely different, correct, but much slower piece of code.

        A subtle variation on my last version acheives a worthwhile speedup. It's about 50% quicker than my previous best and over twice as fast as the original.

        I wish there was a way to put a big red cross through the code as well.

        ## !!! DO NOT USE !!! TOTALLY BOGUS CODE. !!! sub peel2 { my( $s, $n ) = @_; my($start, $stop, $p, $q) = (length $s, 0, 0); ($start, $stop) = ($p, $q) while $p = 1+index( $s, ')', $stop ) < 0 and $q = rindex( $s, '(', $start ) < 0 and $n--; substr $s, $start, $stop - $start; } Rate his mine-less-lgnt buk2 + buk3 his 569/s -- -35% -45% + -62% mine-less-lgnt 875/s 54% -- -15% + -42% buk2 1026/s 80% 17% -- + -32% buk3 1502/s 164% 72% 46% + --

        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


        As my original attempt was so crap, I thought I'd take another crack. Abandoning any attempt at validation and avoiding the regex engine completely, I came up with this which seems to be about 15% better than the best so far.

        sub peel { my( $s, $n ) = @_; my( $start, $stop ) = ( 0, length $s ); ($start,$stop) = ( 1 + index( $s, '(', $start ), rindex( $s, ')', $stop -1 ) # reformatting_for_posting error corrected # Was ) while $n-- > 0 and index( $s, '(', $start +1 ) > 0; ) while $n-- and index( $s, '(', $start +1 ) > 0; substr $s, $start, $stop - $start; } Rate mine-less-lgnt buk2 mine-less-lgnt 918/s -- -15% buk2 1075/s 17% --

        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


        Oh well, worth a try:)


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: Peeling the Peelings
by waswas-fng (Curate) on Jul 01, 2003 at 20:33 UTC
    How many loops is a normal run doing on your data? is it tightly lopped? you could see what qring the regex does to the time -- on some tight regex loops I have noticed 40 - 50% speed ups using a precompiled regex.

    -Waswas

    Ignore this i timed it and qr the regex is slower in this case (must be the () in there...).
Re: Peeling the Peelings
by bobn (Chaplain) on Jul 01, 2003 at 21:43 UTC
    I came up with this:
    sub prop3 { my $str = shift; my $level = shift || 0; return $str unless $level; # 3rd update my @str = split(/[()]/, $str); splice @str, 0, $level; my $out = join('(', @str); $out .= ')' x ( @str - 1 ); return $out; }
    But it's actually about 10% slower than the original. Go figure.


    update: more testing - it really is slower. Very aggravating to me, since it looks like good clean code to me.


    update again: Also came up with this, which is somewhat quicker, but uglier:
    sub getpropstr { local $_ = shift; my $level = shift || 0; if ( $level == -1 ) { /([^()]+)\)+$/; return $1 } while ( $level-- > 0) { chop; s/^[^(]+\(//; } return $_; }
    3rd update: line commented w/ '3rd update' in my first routine makes it closer to the same speed as the OP's original.

    --Bob Niederman, http://bob-n.com
Re: Peeling the Peelings
by yosefm (Friar) on Jul 01, 2003 at 19:11 UTC
    /([^\(]*\(){$level}(.*)\)?/ should get you the level you want ($level is used as a quantifier, inside curlies).
      There's a little quirk with this:
      "hello(what(is(this(all(about)))))" 4
      get_proparg returned: "all(about)"
      Fastpath: "all(about)))))"
      
      "Fastpath" is your solution. It also doesn't behave very well if I $level is a higher number than there already are nested levels - unfortunatedly you never know before how many there are.

      Bye
       PetaMem
          All Perl:   MT, NLP, NLU

Re: Peeling the Peelings
by traveler (Parson) on Jul 01, 2003 at 23:10 UTC
    This one appears faster for the n > -1 case. Compared to get_proparg it is only about 7% worse for the -1 case and is about 80% faster for the 0 case:
    sub arg2 { my($str,$lvl)= @_; my $str1 = $str; return $str if !$lvl; # skip up to and including nth ( paren, then strip n ) parens from + end $str =~ /([^(]*\(){$lvl}(.+)(\)){$lvl}/; return $2 unless $lvl == -1; $str =~ /\(([^\(\)]+)\)+/; return $1; }
    The timing results where lvl=2 are:
    Rate get_proparg arg2 get_proparg 137195/s -- -31% arg2 199115/s 45% --
    HTH, --traveler
      But for the original sample of data, it is hideous. Results:
      Benchmark: timing 15000 iterations of mine-elegant, mine-less-elegant, + original, yours-arg2... mine-elegant: 7 wallclock secs ( 5.19 usr + 0.00 sys = 5.19 CPU) @ +2890.17/s (n=15000) mine-less-elegant: 4 wallclock secs ( 3.19 usr + 0.01 sys = 3.20 CP +U) @ 4687.50/s (n=15000) original: 6 wallclock secs ( 5.06 usr + 0.02 sys = 5.08 CPU) @ 29 +52.76/s (n=15000) yours-arg2: 32 wallclock secs (25.06 usr + 0.08 sys = 25.14 CPU) @ 59 +6.66/s (n=15000)
      The lesson I'm taking away from this: simple regexes can be very fast, but time increasses rapidly with regex complexity.

      --Bob Niederman, http://bob-n.com
        They don't need to. You just have to tell the regex engine exactly what you want. The regex he used has no anchors - why? There's also a bunch of useless capturing parens, and in fact one paren pair that's completely superfluous. All of that is not what we wanted. I've got no Perl here, so I'll have to test this later, but I'm pretty positive that the following works as specified, and very certain that it'll perform tons better.
        /\A (?> (?> [^(]* \( ) {$lvl} ) ( .+ ) \) {$lvl} \z/x;

        Makeshifts last the longest.

        You know, someone ought to write a book about that (time increasing with regex complexity). :-)
        Phooey. The lessons here are: if it looks good and seems to test well, it still might not be good -- test more; somewhat counterintuitively, {} match counts are slower than loops; some optimizations (e.g. those from Aristotle's post) are not as much more effecient as they seem; and as tilly and bobn point out a correlary of the second lesson is that processing a string through multiple single REs is often better than one complex one.

        --traveler

Re: Peeling the Peelings
by bobn (Chaplain) on Jul 02, 2003 at 16:14 UTC
    update2: evdb came up w./ a test harness that works around the problem noted in update1

    update1: the 'evdb' destroys the data, so the benchmarks as run are not valid.

    <strikethrough>All my previous posts about speed are suspect. My current test numbers have suddenly become inconsistent with my previous numbers, even for relative assessments. I don't get it, but right now it looks like the OP's best is better than my best by a little (and of course the 'evdb' solution leaves us all in the dust).</strikethrough>

    --Bob Niederman, http://bob-n.com