Actually, the most interesting behaviour of /PAT/g is when not in list context, but in scalar context. In that case, the next (!) regexp can continue where the previous one left off, using the \G anchor. Conceptually, it is the same as the ^ anchor, which anchors at the beginning of the string — except now it anchors on the current value of pos(), for this string, which is at the end of where the previous pattern matched. Also check out the /c modifier, which prevents reset of the pos pointer to 0 when the match fails, as is the default. So typically, such a lexer could look like this:
$_ = 'And the man said: "Let there be music!"';
while(1) {
/\G\s*(?=\S)/gc or last;
if(/\G(\w+)/gc) {
print "Found a word: $1\n";
} elsif(/\G(['"])/gc) {
print "Found a quote: $1\n";
} elsif(/\G([.,;:!?])/gc) {
print "Found punctuation: $1\n";
} else {
/\G(?=(\S+))/gc;
die sprintf "Don't know what to do with what I found next: %s
+(position %d)", $1, pos;
}
}
print "Parsing completed successfully.\n";
Result:
Found a word: And
Found a word: the
Found a word: man
Found a word: said
Found punctuation: :
Found a quote: "
Found a word: Let
Found a word: there
Found a word: be
Found a word: music
Found punctuation: !
Found a quote: "
Parsing completed successfully.
Try inserting something unrecognizable into your string, like an "=" character, for example.
In addition, I'd like to point out that there is at least one Lex module on CPAN: Parse::Lex. However, I am not familiar with how well it works. | [reply] [d/l] [select] |
Wow. I was almost hesitant to post this SoPW at first, but now I'm glad I did, and more. Even though it wasn't the original intent of the post, I've sped the code up by about 40% (thanks to some back-and-forth with BrowserUk), and that makes me happy. More importantly I've learned several new techniques.
Let me summarize the new things I learned, just for my own sake. :)
- @hash{ somelist } = ();
- m//g in list context
- m//g in scalar context
Man, this is some good stuff. Many thanks all around.
| [reply] [d/l] [select] |