in reply to Re: tokenize plain text messages
in thread tokenize plain text messages
That's a very nice... glad to see a way I can Do It All (tm) with one regex. The only problem is it benchmarks about 53% slower than my _w_strings sub. (Also, each one gives very slightly different results with a more complicated test message. ugh.)
Here's how I used your regex in a subroutine:
sub tokenize_msg_w_oneregex { my ($msg) = @_; my $re = qr{(?:[^\w\'\$!,.-]+|(?:(?<=\D)[.,])|(?:[.,](?=\D|$)))+} +; my %words = map {$_=>1} split $re, $msg; return keys %words; }
And here's the cmpthese output:
Benchmark: timing 10000 iterations of Lists, One Regex, Strings... Lists: 4 wallclock secs ( 4.15 usr + 0.00 sys = 4.15 CPU) @ 24 +09.64/s (n=10000) One Regex: 4 wallclock secs ( 3.56 usr + 0.00 sys = 3.56 CPU) @ 28 +08.99/s (n=10000) Strings: 2 wallclock secs ( 2.33 usr + 0.00 sys = 2.33 CPU) @ 42 +91.85/s (n=10000) Rate Lists One Regex Strings Lists 2410/s -- -14% -44% One Regex 2809/s 17% -- -35% Strings 4292/s 78% 53% --
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Re: tokenize plain text messages
by BrowserUk (Patriarch) on May 10, 2003 at 02:32 UTC | |
by revdiablo (Prior) on May 10, 2003 at 04:13 UTC | |
by BrowserUk (Patriarch) on May 10, 2003 at 06:03 UTC | |
by revdiablo (Prior) on May 10, 2003 at 06:51 UTC |