in reply to Re: What is the output for this ??
in thread What is the output for this ??

My program is : $_ = "I have 2 numbers: 53147"; if (/(.*?)(\d+)/) { print "Beginning is <$1>,number is <$2>.\n"; } Output will be : $1= I have $2=2. Can you please explain?

Replies are listed 'Best First'.
Re^3: What is the output for this ??
by Marshall (Canon) on Dec 08, 2010 at 08:07 UTC
    Use <code> and </code> around your code to make it easier to read when posting. When you show the output of a print statement, just show the output, not what say $1 or $2 is (ie. not $1= I have) - just show the output of the print statement.

    Do it like this:

    $_ = "I have 2 numbers: 53147"; if (/(.*?)(\d+)/) { print "Beginning is <$1>,number is <$2>.\n"; } #prints: Beginning is <I have >,number is <2>. # Don't tell me that $1 = "I have ". # Just execute the print statement and show the output.
    First, it is almost always a bad idea to assign to $_ explicitly. I recommend against doing that. Better is:
    use warnings; use strict; my $string = "I have 2 numbers: 53147"; if ($string =~ /(.*?)(\d+)/) { print "Beginning is <$1>,number is <$2>.\n"; } #prints: Beginning is <I have >,number is <2>.
    What you have here is what is called a "regular expression" or "regex". m/(.*?)(\d+)/ (m or match is implicit). This regex means that we are going to match the minimum span of any characters (may, by the way even be zero characters) that still allows the next regex term to "match" if it is possible to do so.

    So basically, "(.*?)" means all characters up to but not including the first digit seen - the shortest string that doesn't include the first digit - note: this does include the space before the first digit seen. "(\d+)" means now that we have seen a digit, get me all digits that are sequential. This is how you get "I have " and then "2" for $1 and $2 respectively.

    You should experiment when faced with a regex like this. Change the string to be say: "I have 6718 numbers: 53147" and see what that prints. It will print: Beginning is <I have >,number is <6718>. "2" has now become "6718", just like the previous paragraph would lead you to believe would happen.

    Now, lets experiment more. That ? in the first capture term matters a lot! The ? "minimizes" the length of the match. Let's say that we have (no ? character):

    my $string = "I have 2 numbers: 53147"; if ($string =~ /(.*)(\d+)/) { print "Beginning is <$1>,number is <$2>.\n"; } #prints: Beginning is <I have 2 numbers: 5314>,number is <7>.
    That (.*) means: give me the maximal length string while still allowing (\d+) to match. Working from the right, "7" is the shortest thing that matches "one or more digits"(\d+) and sure enough (.*) matches everything in front of that. (.*) matches the longest thing that still allows (\d+) to match, albeit with just a single digit!

    Let's say that you knew that that were two numbers (sequences of digits) in this string.

    my $string = "I have 325 numbers: 98765 12324"; if ($string =~ /(\d+)\D+(\d+)/) { print "Beginning is <$1>,number is <$2>.\n"; } #prints: Beginning is <325>,number is <98765>.
    The regex says: capture the first sequence of digits, ignore a sequence of one or more non-digits and then capture the next sequence of digits.

    This whole business of regex can become VERY complicated. The classic book on this is: Mastering Regular Expressions by Jeffrey Friedl. Fortunately, the vast majority of regex's don't require anywhere near the knowledge required to understand Friedl's book!

    In Perl:

    \d, a digit[0-9] \D a non digit \w, a word character[a-zA-Z0-9_] \W a non word character \s, a white space char [\s\t\f\r\n] \S a non-whitespace char
    is normally all you need to know along with some simple rules about minimal and maximal matches.
      Thanks a lot. It was a wonderful explanation.Thank you once again for your kind help
Re^3: What is the output for this ??
by NetWallah (Canon) on Dec 08, 2010 at 06:58 UTC
    The output of your program is
    Beginning is <I have >,number is <2>.
    The ".*+?" matches non greedily(Because of the ?), so the next "\d+" matches the first digit(s) it encounters, which is "2".

    This means that the first expression (.*+?) gets everything before the "2".

         Syntactic sugar causes cancer of the semicolon.        --Alan Perlis