in reply to Re: Re: tokenize plain text messages
in thread tokenize plain text messages
Result:$_ = 'And the man said: "Let there be music!"'; while(1) { /\G\s*(?=\S)/gc or last; if(/\G(\w+)/gc) { print "Found a word: $1\n"; } elsif(/\G(['"])/gc) { print "Found a quote: $1\n"; } elsif(/\G([.,;:!?])/gc) { print "Found punctuation: $1\n"; } else { /\G(?=(\S+))/gc; die sprintf "Don't know what to do with what I found next: %s +(position %d)", $1, pos; } } print "Parsing completed successfully.\n";
Found a word: And Found a word: the Found a word: man Found a word: said Found punctuation: : Found a quote: " Found a word: Let Found a word: there Found a word: be Found a word: music Found punctuation: ! Found a quote: " Parsing completed successfully.Try inserting something unrecognizable into your string, like an "=" character, for example.
In addition, I'd like to point out that there is at least one Lex module on CPAN: Parse::Lex. However, I am not familiar with how well it works.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Re: Re: tokenize plain text messages
by revdiablo (Prior) on May 11, 2003 at 09:02 UTC |