That's a very nice... glad to see a way I can Do It All (tm) with one regex. The only problem is it benchmarks about 53% slower than my _w_strings sub. (Also, each one gives very slightly different results with a more complicated test message. ugh.)
Here's how I used your regex in a subroutine:
sub tokenize_msg_w_oneregex { my ($msg) = @_; my $re = qr{(?:[^\w\'\$!,.-]+|(?:(?<=\D)[.,])|(?:[.,](?=\D|$)))+} +; my %words = map {$_=>1} split $re, $msg; return keys %words; }
And here's the cmpthese output:
Benchmark: timing 10000 iterations of Lists, One Regex, Strings... Lists: 4 wallclock secs ( 4.15 usr + 0.00 sys = 4.15 CPU) @ 24 +09.64/s (n=10000) One Regex: 4 wallclock secs ( 3.56 usr + 0.00 sys = 3.56 CPU) @ 28 +08.99/s (n=10000) Strings: 2 wallclock secs ( 2.33 usr + 0.00 sys = 2.33 CPU) @ 42 +91.85/s (n=10000) Rate Lists One Regex Strings Lists 2410/s -- -14% -44% One Regex 2809/s 17% -- -35% Strings 4292/s 78% 53% --
In reply to Re: Re: tokenize plain text messages
by revdiablo
in thread tokenize plain text messages
by revdiablo
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |