PerlPhi has asked for the wisdom of the Perl Monks concerning the following question:

hi,,,im a beginner on perl.actually it is my first server-side scripting language to deal with...for the reason that it is difficult for beginners.they say it is more on symbols unlike php with so many pre-defined routines, right?...but then i like to deal with the hard ones like assembly...

presently im reading the "Llama book".i started learning perl since april 1 this year.perl is great, and i think my summer is not a waste and im having a good time with it...except for one thing that i want to clarify about regular expression if you permit me.

my delima goes this way: there's a string like this "fred and barney went bowling last night", and if i want to match the words "fred" and "barney" my pattern goes this way (/fred.+barney/), right? but this pattern will take some time because i am using the greedy quantifier which is (.+), right? so to reduce the time it take to match like that one, i will use the non-greedy quantifier which is (.+?)...now my pattern will now look like this (/fred.+?barney/).

based on the book and from my own understanding the logic of the pattern (/fred.+barney/) goes this way:

  1. the fred word from the pattern will match to the string.
  2. the (.+) from the pattern will match the character next to the word fred, then the character 'a','n','d',' ', until it reaches 't' at the end of the string.
  3. then the word barney will be matched and if can't (.+) will match and that will be repeated until it reaches the character 'y', then the word barney will match again and because the word barney was present on the past characters that was tried to match (" and barney" <-- the characters that were matched lately), the word barney will now finally match, then the search for the pattern will now exit.

was my logic on pattern match using the greedy quantifier (.+) correct? if not, then what would be the correct way to understand perl's regular expression engine on that certain pattern?

my second question is about non-greedy quantifier (.+?)...based on the book and from my own understanding the logic of the pattern (/fred.+?barney/) goes this way:

  1. the fred word from the pattern will match to the string.
  2. the (.+?) from the pattern will match the character next to the word fred.
  3. the word barney will try to match and if can't (.+?) will match and that will be repeated until it reaches the character 'y', then the word barney will match again and because the word barney was present on the past characters that was tried to match (" and barney" <-- the characters that were matched lately), the word barney will now finally match, then the search for the pattern will now exit.

was it correct about my logic on non-greedy (.+?) quantifier? if not, then what would be the correct way to understand perl's regular expression engine on that certain pattern?

oww that's it...those are my two problems that needs answering...thanks in advance guys...keep deep and dark!

From: PerlPhi

  • Comment on About Greedy and Non-Greedy Regular Expressions

Replies are listed 'Best First'.
Re: About Greedy and Non-Greedy Regular Expressions
by marto (Cardinal) on Apr 25, 2007 at 12:40 UTC
Re: About Greedy and Non-Greedy Regular Expressions
by RMGir (Prior) on Apr 25, 2007 at 13:40 UTC
    PerlPhi, welcome to the Monastery.

    Your understanding is mostly correct. Under the hood, perl's RE engine does a lot of things to optimize how .+ and .+? work - you can see some of that if you compare

    perl -Mre=debug -e'"Fred and Barney went bowling last night. Barney h +ad soup."=~/Fred.+?Barney/;'
    and
    perl -Mre=debug -e'"Fred and Barney went bowling last night. Barney h +ad soup."=~/Fred.+Barney/;'
    But your "mental model" appears correct, if I'm following your long paragraph. :)

    As marto pointed out, please check out the formatting rules. You can edit your own post, and add <p> in between paragraphs - that will help people understand your question.


    Mike
Re: About Greedy and Non-Greedy Regular Expressions
by ww (Archbishop) on Apr 25, 2007 at 14:05 UTC
    Whew! That's hard to read. But -- hoping I've read it correctly,
    #!C:/perl/bin -w use strict; # regextiming.pl use vars qw( $string $ms $ms2 ); $string = "fred and barney went bowling last night"; { $ms = Win32::GetTickCount(); if ( $string =~ /fred.+barney/) { } else { print "No match\n\n"; } print "This match used ",Win32::GetTickCount() - $ms, " ticks on a w2k + box\n\n"; } # alt --OP expects this to be quicker { $ms2 = Win32::GetTickCount(); if ( $string =~ /fred.+?barney/) { } else { print "No match via second refex\n\n"; exit(); } print "The non-greedy match used ",Win32::GetTickCount() - $ms2, " tic +ks on a w2k box.\n"; }

    OUTPUT:
    F:\_wo\pl_test>perl regextiming.pl
    This match used 0 ticks on a w2k box

    The non-greedy match used 0 ticks on a w2k box

    Update:fixed spelling, grammar Prior discussion found with a casual search suggests that the identical timines times, to no_better_than one millisecond, is are or was were a limitation of w32. Opinion was divided as to whether Time::HiRes could do better.

    So...

    # regextiming2.pl #!C:/perl/bin -w use strict; use Time::HiRes 'time'; # regextiming2.pl use vars qw( $string $start $start2 $end $end2 ); $string = "fred and barney went bowling last night"; { $start = time(); if ( $string =~ /fred.+barney/) { } else { print "No match\n\n"; } $end = time(); print "This match used ",$end - $start, " as measured by Time::HiRes o +n a w2k box\n\n"; } { $start2 = time(); if ( $string =~ /fred.+barney/) { } else { print "No match\n\n"; } $end2 = time(); print "The second match used ",$end2 - $start2, " as measured by Time: +:HiRes on a w2k box\n\n"; }

    OUTPUT2:

    F:\_wo\pl_test>perl regextiming2.pl
    This match used 1.00135803222656e-005 as measured by Time::HiRes on a w2k box

    The second match used 6.91413879394531e-006 as measured by Time::HiRes on a w2k box

    QED ... I hope.

      A test with longer strings shows that the non-greedy regex can be about 25% faster (with perl 5.8.7). My test case regex is designed to match starting near the beginning of the string and ending in the middle.
      $ perl -MBenchmark=cmpthese -ne'chomp; $allwords.="/$_"; END{ cmpthese(-3, { "Greedy"=>sub { $allwords=~m{/abbreviate.+/initial/}; }, "Nongreedy"=>sub { $allwords=~m{/abbreviate.+?/initial/}; } } ); }' /usr/dict/words
      The result is:
                 Rate    Greedy Nongreedy
      Greedy    439/s        --      -22%
      Nongreedy 563/s       28%        --
      
      Of course, to repeat this test you'll need /usr/dict/words, and it has to contain abbreviate and initial.

      These results are with perl 5.8.7 - I'd be curious to see what happens with 5.8.10, since RE improvements could happen any time.


      Mike