in reply to Re: tokenize plain text messages
in thread tokenize plain text messages

Thanks for pointing out an alternate way to do this. I was not previously aware of the match-with-/g-in-list-context behavior, but it looks very useful.

Note: BrowserUk did show an example of this in our thread above, but I didn't really notice it until now. I guess it got lost in the noise (of my brain). :)

Replies are listed 'Best First'.
Re: Re: Re: tokenize plain text messages
by bart (Canon) on May 11, 2003 at 00:38 UTC
    Actually, the most interesting behaviour of /PAT/g is when not in list context, but in scalar context. In that case, the next (!) regexp can continue where the previous one left off, using the \G anchor. Conceptually, it is the same as the ^ anchor, which anchors at the beginning of the string — except now it anchors on the current value of pos(), for this string, which is at the end of where the previous pattern matched. Also check out the /c modifier, which prevents reset of the pos pointer to 0 when the match fails, as is the default. So typically, such a lexer could look like this:
    $_ = 'And the man said: "Let there be music!"'; while(1) { /\G\s*(?=\S)/gc or last; if(/\G(\w+)/gc) { print "Found a word: $1\n"; } elsif(/\G(['"])/gc) { print "Found a quote: $1\n"; } elsif(/\G([.,;:!?])/gc) { print "Found punctuation: $1\n"; } else { /\G(?=(\S+))/gc; die sprintf "Don't know what to do with what I found next: %s +(position %d)", $1, pos; } } print "Parsing completed successfully.\n";
    Result:
    Found a word: And
    Found a word: the
    Found a word: man
    Found a word: said
    Found punctuation: :
    Found a quote: "
    Found a word: Let
    Found a word: there
    Found a word: be
    Found a word: music
    Found punctuation: !
    Found a quote: "
    Parsing completed successfully.
    
    Try inserting something unrecognizable into your string, like an "=" character, for example.

    In addition, I'd like to point out that there is at least one Lex module on CPAN: Parse::Lex. However, I am not familiar with how well it works.

      Wow. I was almost hesitant to post this SoPW at first, but now I'm glad I did, and more. Even though it wasn't the original intent of the post, I've sped the code up by about 40% (thanks to some back-and-forth with BrowserUk), and that makes me happy. More importantly I've learned several new techniques.

      Let me summarize the new things I learned, just for my own sake. :)

      • @hash{ somelist } = ();
      • m//g in list context
      • m//g in scalar context

      Man, this is some good stuff. Many thanks all around.