snowsky has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

Assume I have a variable $dino that has the value "1000", will $dino match the pattern "\d*"? If so, in my program, why it gives me an empty string after comparing?

my program is below :

my $sentence = "I fear that i will be extinct after 1000 years"; if($sentence =~ /(\d*)/) { print "That said '$1' years.\n"; }

result: That said '' years.

Need your help!!

Thank,

Cindy

Replies are listed 'Best First'.
Re: question about the star(*) quantifier
by Eimi Metamorphoumai (Deacon) on Dec 21, 2006 at 17:32 UTC
    The problem is that /(\d*)/ allows zero or more digits, so it will match every string. Perl will first attempt to match starting at the beginning of the string, and match as many digits as possible. So it looks at the beginning, sees that it matches zero digits, and succeeds. The simple solution would be to use /(\d+)/ (which will force it to go until it finds at least one digit). Depending on your data and exact situation, there may be other approaches that would work better for you.
      Here's an illustration of what Eimi Metamorphoumai described.
      sub test { my ($sentence, $re) = @_; print("Sentence: $sentence\n"); print("Regexp: $re\n"); if ($sentence =~ /($re)/) { printf("Matched %d characters (%s) at pos %d\n", length($1), $1, $-[1]); } else { print("No match\n"); } print("\n"); } test("1234", qr/\d*/); # Matched 4 characters (1234) at pos 0 test("abc1234", qr/\d*/); # Matched 0 characters () at pos 0 test("abc", qr/\d*/); # Matched 0 characters () at pos 0 test("1234", qr/\d+/); # Matched 4 characters (1234) at pos 0 test("abc1234", qr/\d+/); # Matched 4 characters (1234) at pos 3 (*1) test("abc", qr/\d+/); # No match (*2)

      *1 - Since it failed at positions 0, 1 and 2.
      *2 - Since it failed at all positions.

      Update: Combined the code and the output to condense the node and improve readability.

      You would see similar behavior if you try this:
      my $sentence = "I fear that i will be extinct after 1000 or 2000 years +"; if($sentence =~ /(\d+)/) { print "That said '$1' years.\n"; }
      Since Perl is stopping at the first successful match, you'll only get the first number. Since the empty string is a successful match for \d*, that's what you're getting :)
        Untrue! Please try it and I think you'll see that 1000 is indeed matched. Perl's + modifer is greedy, meaning it will match as much as it can. Perl also takes the first match it can find, which is why the first one stopped at the start of the string. (Or perhaps you meant "1000" by the "first number". I took it to mean "1" which some regex engines might yield.)

        -sam

Re: question about the star(*) quantifier
by samtregar (Abbot) on Dec 21, 2006 at 17:39 UTC
    To elaborate on the previous answer, Perl's regex engine finds the left-most match, stopping as soon as it gets a match. In this case that is the zero-length string at the beginning of your text. Other regex engines are different - some prefer the longest possible match, which would have yielded your expected "1000".

    A good book about regexes is Mastering Regular Expressions, which answers this question in much more detail and many others. I highly recommend it for all Perl programmers.

    -sam

      Hi Sam,
      Perl's regex engine finds the left-most match

      I would have phrased it like this: like most engines Perl's looks by default for the longest leftmost match.

      % stephan@labaule (/home/stephan) % % echo "okay stephan do test it!" | + > perl -lne 'm!(steph.?)!; print "seen: [$1]\n"' seen: [stepha]

      Actually POSIX mandates longest possible for alternations as they all start at the same place. The owl book calls that POSIX NFA.

      % stephan@labaule (/home/stephan) % % echo "okay stephan do test it!" | > perl -lne 'm!(steph|steph.?)!; print "seen: [$1]\n"' seen: [steph]

      so yes, perl's is not POSIX NFA (for various efficiency reasons)

      Actually not many engines follow POSIX on this, mostly (ugly -- because too ancient) system libraries (on Un*x), ksh's does too, even the Hackerlab library does (on second thought seems normal as it was supposed to be a drop-in replacement for the C lib on Un*x -- and POSIX). Tcl's is hybrid so I am not sure. And every engine after Perl5 birth has essentially copied Perl's.

      regards --stephan
      by the way ksh notation is interesting (zsh can masquerade as ksh for this if you don't have a ksh to try)
        I first wrote "longest left-most" and then revised it to just "left-most" when I remembered that Perl stops at the first matching alternation, longest or not. Maybe it's useful to think of it as "longest" anyway, but I prefer "greedy".

        -sam

      Thanks for all your quick replys, but i still have a question.

      The pattern /(\d*)/'s first match should be a number, right? instead of the zero length string. In my case, it should find character '1' first and assign value '1' to $1.

      Please advice. :)

        Nope. \d* means "zero or more numbers" not "one or more numbers". It is perfectly valid for \d* to match the empty string, which is "zero numbers". If you want to be sure to match a number, use \d+, which means "one or more numbers".

        -sam

        It's already been said above, but I'll restate it:

        \d* means explicitly: "Match 0 or more digits"

        That matches any sequence of characters that is "0 or more digits".

        At the very beginning of your string (leftmost portion), a string matches that - an empty string '' right before "I fear that...".

        Update: I guess samtregar beat me to it, but our message is the same :)



        --chargrill
        s**lil*; $*=join'',sort split q**; s;.*;grr; &&s+(.(.)).+$2$1+; $; = qq-$_-;s,.*,ahc,;$,.=chop for split q,,,reverse;print for($,,$;,$*,$/)
Re: question about the star(*) quantifier
by SFLEX (Chaplain) on Dec 21, 2006 at 17:38 UTC