in reply to tokenize plain text messages
If you don't find the Regexp::Common::Revdiablo :) module, this might give you a starting place. It's not tested much beyond what you see, and I think it could be simplified.
$s = 'This, is, an, example. Keep $2.50, 1,500, and 192.168.1.1.'; $re_revdiablo = qr[(?:[^\w\'\$!,.-]+|(?:(?<=\D)[.,])|(?:[.,](?=\D|$)) +)+]; print join ' | ', split $re_revdiablo, $s; This | is | an | example | Keep | $2.50 | 1,500 | and | 192.168.1.1
I tried to use the /x modifier to break up the density of the regex, but that doesn't seem to work with split?
Update: I'm talking crap. /x does work with split provided you don't put spaces between the \ and the character it is escaping. D'oh!
$re_revdiablo = qr[ (?: # group, no capture [^\w\'\$!,.-] # on anything not in your list | (?: (?<= \D ) [.,] ) # or . or, if preceded by a non nu +meric | (?: [.,] (?= \D | $) # or . or, if followed by a non nu +meric or EOL ) )+ # 1 or more ]x;
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: tokenize plain text messages
by revdiablo (Prior) on May 10, 2003 at 01:55 UTC | |
by BrowserUk (Patriarch) on May 10, 2003 at 02:32 UTC | |
by revdiablo (Prior) on May 10, 2003 at 04:13 UTC | |
by BrowserUk (Patriarch) on May 10, 2003 at 06:03 UTC | |
by revdiablo (Prior) on May 10, 2003 at 06:51 UTC |