Actually, the most interesting behaviour of
/PAT/g is when
not in list context, but in scalar context. In that case, the next (!) regexp can continue where the previous one left off, using the \G anchor. Conceptually, it is the same as the ^ anchor, which anchors at the beginning of the string — except now it anchors on the current value of pos(), for this string, which is at the end of where the previous pattern matched. Also check out the /c modifier, which prevents reset of the pos pointer to 0 when the match fails, as is the default. So typically, such a lexer could look like this:
$_ = 'And the man said: "Let there be music!"';
while(1) {
/\G\s*(?=\S)/gc or last;
if(/\G(\w+)/gc) {
print "Found a word: $1\n";
} elsif(/\G(['"])/gc) {
print "Found a quote: $1\n";
} elsif(/\G([.,;:!?])/gc) {
print "Found punctuation: $1\n";
} else {
/\G(?=(\S+))/gc;
die sprintf "Don't know what to do with what I found next: %s
+(position %d)", $1, pos;
}
}
print "Parsing completed successfully.\n";
Result:
Found a word: And
Found a word: the
Found a word: man
Found a word: said
Found punctuation: :
Found a quote: "
Found a word: Let
Found a word: there
Found a word: be
Found a word: music
Found punctuation: !
Found a quote: "
Parsing completed successfully.
Try inserting something unrecognizable into your string, like an "=" character, for example.
In addition, I'd like to point out that there is at least one Lex module on CPAN: Parse::Lex. However, I am not familiar with how well it works.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.