a regex pattern how to understand

Thai Heng has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: a regex pattern how to understand by kcott (Archbishop) on Jul 09, 2015 at 02:12 UTC
G'day Thai Heng, Perhaps your wording, "alphabetic characters plus* a space", is a clue to what you're not understanding. I probably would have said "alphabetic characters or a space"*. Some documentation that may help you: perlrecharclass - Perl Regular Expression Character Classes perlrecharclass: Bracketed Character Classes perlrecharclass: POSIX Character Classes There are also some tools that can help you to understand regex patterns. I see you've already been shown "`use re 'debug';`": you may have found the output somewhat esoteric. Here's a couple more that may be more suitable. Damian Conway's Regexp::Debugger is a particular favourite of mine. It provides a dynamic explanation of each step that the regex engine performs. It's output shows: the position in the string and the part of the regex trying to match at this position; what's been matched so far; what values `$1`, `$2`, etc. currently hold; and so on. The simplest usage is to just add `use Regexp::Debugger;` near the top of your code; run your code; and use '`s`' to step through each operation. See the documentation for more commands and other information. YAPE::Regex::Explain is easy to use. It provides a static explanation of the regex you give it. While you're learning, and using regexes such as you have here, this should be fine; when you start looking at newer, more advanced regex constructs, this module won't help you (see its LIMITATIONS section for more details). `#!/usr/bin/env perl use strict; use warnings; use YAPE::Regex::Explain; my $re = qr{Name:\s+([[:alpha:] ]+?)\s+Age:\s+(\d+)}; print YAPE::Regex::Explain::->new($re)->explain;` [download] Output: The regular expression: (?-imsx:Name:\s+([[:alpha:] ]+?)\s+Age:\s+(\d+)) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- Name: 'Name:' ---------------------------------------------------------------------- \s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- [[:alpha:] ]+? any character of: letters, ' ' (1 or more times (matching the least amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- \s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- Age: 'Age:' ---------------------------------------------------------------------- \s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ( group and capture to \2: ---------------------------------------------------------------------- \d+ digits (0-9) (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \2 ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- [download] -- Ken	[reply] [d/l] [select]
Re: a regex pattern how to understand by Anonymous Monk on Jul 08, 2015 at 22:14 UTC
The `[[:alpha:] ]` matches alphabetic characters plus a space character Correct, so I think "\s+Age:" should match one or more space and "Age:" correct, There is only one space between name and age. and correct. And the code you show produces the output you show. If you want to understand what the regex is doing, you can look at the gory details of how Perl handles it by adding `use re 'debug';` at the top of your code. Or, have a look at `https://regex101.com/r/pP4dN2/1` Otherwise, I don't understand what exactly you are asking?	[reply] [d/l] [select]
Re^2: a regex pattern how to understand by Thai Heng (Beadle) on Jul 08, 2015 at 22:54 UTC
The space matched in the former `[[:alpha:] ]`, and can't matched in the last \s+Age:。So I think the text can't matched the reg pattern. Because there is only one space between name and age.	[reply] [d/l]
Re^3: a regex pattern how to understand by Anonymous Monk on Jul 08, 2015 at 23:11 UTC
`The space matched in the former [[:alpha:] ], and can't matched in the last \s+Age:。So I think the text can't matched the reg pattern. Because there is only one space between name and age.` Hmm, if perl matches it, why do you think that is? If you turn on `use re 'debug' ;` what do you see? I see 149 <Princ> <e Age: 53> \| 21: CLOSE1(23) 149 <Princ> <e Age: 53> \| 23: PLUS(25) SPACE can match 0 times out of 2 +147483647... failed... ANYOF[ A-Za-z][{unicode}+utf8::XPo +sixAlpha 00AA 00B5 00BA 00C0-00D6 00D8-00F6 00F8- 02C1] can match 1 times out of 1... 150 <rince> < Age: 53%n> \| 21: CLOSE1(23) 150 <rince> < Age: 53%n> \| 23: PLUS(25) SPACE can match 1 times out of 2 +147483647... 151 <ince > <Age: 53%nO> \| 25: EXACT <Age:>(27) 155 < Age:> < 53%nOccup> \| 27: PLUS(29) SPACE can match 1 times out of + 2147483647... [download] +? means match the least amount possible, the least amount doesn't include space, because the next pattern wants space	[reply] [d/l] [select]
Re^3: a regex pattern how to understand by Anonymous Monk on Jul 08, 2015 at 23:26 UTC
As the other anon already said, the `+?` modifier makes the expression non-greedy, meaning it doesn't consume all possible characters. Even if the `?` is removed, the regular expression still works due to Backtracking: For a regular expression to match, the entire regular expression must match, not just part of it. So if the beginning of a pattern containing a quantifier succeeds in a way that causes later parts in the pattern to fail, the matching engine backs up and recalculates the beginning part--that's why it's called backtracking.	[reply] [d/l] [select]
Re: a regex pattern how to understand by hexcoder (Curate) on Jul 09, 2015 at 08:21 UTC
Hello, try this instead `while ( $text =~ m<Name:\s+([[:alpha:] ]+?)\sAge:\s+(\d+)>g ) {` I changed the `\s+Age` part to `\sAge`, since the space before 'Age' is already consumed by the previous construct, and so it is here not available anymore.	[reply] [d/l] [select]
Re^2: a regex pattern how to understand by Anonymous Monk on Jul 09, 2015 at 08:48 UTC
the space before 'Age' is already consumed by the previous construct, and so it is here not available anymore. Sorry, but that's just plain incorrect. The code provided by the OP works just fine.	[reply]