Thai Heng has asked for the wisdom of the Perl Monks concerning the following question:

my $text = <<'END'; Name: Alice Allison Age: 23 Occupation: Spy Name: Bob Barkely Age: 45 Occupation: Fry Cook Name: Carol Carson Age: 44 Occupation: Manager Name: Prince Age: 53 Occupation: World Class Musician END my %age_for; while ( $text =~ m<Name:\s+([[:alpha:] ]+?)\s+Age:\s+(\d+)>g ) { $age_for{$1} = $2; } print Dumper(\%age_for);

The [:alpha: ] matches alphabetic characters plus a space character, so I think "\s+Age:" should match one or more space and "Age:". There is only one space between name and age.

Why?

the following output:

$VAR1 = { ‘Bob Barkely’ => ‘45’, ‘Alice Allison’ => ‘23’, ‘Carol Carson’ => ‘44’, ‘Prince’ => ‘53’ };

This question come from the book <beginning perl> page 234.

Replies are listed 'Best First'.
Re: a regex pattern how to understand
by kcott (Archbishop) on Jul 09, 2015 at 02:12 UTC

    G'day Thai Heng,

    Perhaps your wording, "alphabetic characters plus a space", is a clue to what you're not understanding. I probably would have said "alphabetic characters or a space".

    Some documentation that may help you:

    There are also some tools that can help you to understand regex patterns. I see you've already been shown "use re 'debug';": you may have found the output somewhat esoteric. Here's a couple more that may be more suitable.

    Damian Conway's Regexp::Debugger is a particular favourite of mine. It provides a dynamic explanation of each step that the regex engine performs. It's output shows: the position in the string and the part of the regex trying to match at this position; what's been matched so far; what values $1, $2, etc. currently hold; and so on. The simplest usage is to just add use Regexp::Debugger; near the top of your code; run your code; and use 's' to step through each operation. See the documentation for more commands and other information.

    YAPE::Regex::Explain is easy to use. It provides a static explanation of the regex you give it. While you're learning, and using regexes such as you have here, this should be fine; when you start looking at newer, more advanced regex constructs, this module won't help you (see its LIMITATIONS section for more details).

    #!/usr/bin/env perl use strict; use warnings; use YAPE::Regex::Explain; my $re = qr{Name:\s+([[:alpha:] ]+?)\s+Age:\s+(\d+)}; print YAPE::Regex::Explain::->new($re)->explain;

    Output:

    The regular expression: (?-imsx:Name:\s+([[:alpha:] ]+?)\s+Age:\s+(\d+)) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- Name: 'Name:' ---------------------------------------------------------------------- \s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- [[:alpha:] ]+? any character of: letters, ' ' (1 or more times (matching the least amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- \s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- Age: 'Age:' ---------------------------------------------------------------------- \s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ( group and capture to \2: ---------------------------------------------------------------------- \d+ digits (0-9) (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \2 ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

    -- Ken

Re: a regex pattern how to understand
by Anonymous Monk on Jul 08, 2015 at 22:14 UTC
    The [[:alpha:] ] matches alphabetic characters plus a space character

    Correct,

    so I think "\s+Age:" should match one or more space and "Age:"

    correct,

    There is only one space between name and age.

    and correct. And the code you show produces the output you show.

    If you want to understand what the regex is doing, you can look at the gory details of how Perl handles it by adding use re 'debug'; at the top of your code. Or, have a look at https://regex101.com/r/pP4dN2/1

    Otherwise, I don't understand what exactly you are asking?

      The space matched in the former [[:alpha:] ], and can't matched in the last \s+Age:。So I think the text can't matched the reg pattern. Because there is only one space between name and age.
        The space matched in the former [[:alpha:] ], and can't matched in the last \s+Age:&#12290;So I think the text can't matched the reg pattern. Because there is only one space between name and age.

        Hmm, if perl matches it, why do you think that is?

        If you turn on use re 'debug' ; what do you see?

        I see

        149 <Princ> <e Age: 53> | 21: CLOSE1(23) 149 <Princ> <e Age: 53> | 23: PLUS(25) SPACE can match 0 times out of 2 +147483647... failed... ANYOF[ A-Za-z][{unicode}+utf8::XPo +sixAlpha 00AA 00B5 00BA 00C0-00D6 00D8-00F6 00F8- 02C1] can match 1 times out of 1... 150 <rince> < Age: 53%n> | 21: CLOSE1(23) 150 <rince> < Age: 53%n> | 23: PLUS(25) SPACE can match 1 times out of 2 +147483647... 151 <ince > <Age: 53%nO> | 25: EXACT <Age:>(27) 155 < Age:> < 53%nOccup> | 27: PLUS(29) SPACE can match 1 times out of + 2147483647...

        +? means match the least amount possible, the least amount doesn't include space, because the next pattern wants space

        As the other anon already said, the +? modifier makes the expression non-greedy, meaning it doesn't consume all possible characters. Even if the ? is removed, the regular expression still works due to Backtracking:

        For a regular expression to match, the entire regular expression must match, not just part of it. So if the beginning of a pattern containing a quantifier succeeds in a way that causes later parts in the pattern to fail, the matching engine backs up and recalculates the beginning part--that's why it's called backtracking.
Re: a regex pattern how to understand
by hexcoder (Curate) on Jul 09, 2015 at 08:21 UTC
    Hello,

    try this instead

    while ( $text =~ m<Name:\s+([[:alpha:] ]+?)\s*Age:\s+(\d+)>g ) {

    I changed the \s+Age part to \s*Age, since the space before 'Age' is already consumed by the previous construct, and so it is here not available anymore.

      the space before 'Age' is already consumed by the previous construct, and so it is here not available anymore.

      Sorry, but that's just plain incorrect. The code provided by the OP works just fine.