A couple of things I notice once I played with the expanded testcase. Your strings version is capturing single spaces somewhere. And both your strings and my regex are letting single $'s through as well.
I have a m/(..)/g version which avoids these two problems and is quicker than my split attempt, but still not as quick as your strings in it's current form.
our $RE_WORDS = qr[ (?: \$? \d+ (?:[.,] \d+ )*)+ | [\w\'!-]+ ]xo; sub tokenize_msg_w_m { my %words; @words{ shift =~ m[($RE_WORDS)]og } = (); return keys %words; }
Results
D:\Perl\test>257026 Regex: $ | want | 192.168.1.1 | $57 | confusing | yword | hopefully | combinations | LITTEL!!!!L | a | of | bit | is | This | to | will | text | this | 1,500 | Keep | out | BITH!!!!! | it | example | work | unhapp | $2.50 | little | MORE | I | some | thing | with | and | an Strings: $ | | want | 192.168.1.1 | $57 | confusing | yword | hopefully | combinations | LITTEL!!!!L | a | of | bit | is | This | to | will | text | this | 1,500 | Keep | out | BITH!!!!! | it | example | work | unhapp | $2.50 | little | MORE | I | some | thing | with | and | an Match: want | 192.168.1.1 | $57 | confusing | yword | hopefully | combinations | LITTEL!!!!L | a | of | bit | is | This | to | will | text | this | 1,500 | Keep | out | BITH!!!!! | it | example | work | unhapp | $2.50 | little | MORE | I | some | thing | with | and | an Rate regex match strings regex 526/s -- -13% -26% match 607/s 15% -- -15% strings 710/s 35% 17% --
The other question that crossed my mind was, what happens if the text contains "Do 33% of people watch T.V.?" ?
In reply to Re: Re: Re: Re: Re: tokenize plain text messages
by BrowserUk
in thread tokenize plain text messages
by revdiablo
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |