in reply to Re: Re: tokenize plain text messages
in thread tokenize plain text messages
Try this version of your ....oneregex sub. (Phew! Long names:).
D:\Perl\test>257026 192.168.1.1 | $2.50 | Keep | example | and | is | This | 1,500 | an Rate lists strings regex lists 757/s -- -56% -61% strings 1736/s 129% -- -10% regex 1930/s 155% 11% --
As the regex never changes, there is no need to recompile it every time you call the sub, so I made it a constant. I would usually put the use constant.. inside the sub where it is used to keep it tidy, but I got beaten up because it implies that the constant is lexically scoped which it isn't. This doesn't fool me, but it is your choice.
use constant RE_WORDS => qr[(?:[^\w\'\$!,.-]|(?:(?<=\D)[.,])|(?:[.,](? +=\D|$)))+]; sub tokenize_msg_w_oneregex { my %words; @words{ split RE_WORDS, shift } = (); return keys %words; }
To be fair, a large part of the savings is avoiding the map and initialising every hash element to 1 when you will never use the value. Doing the split inside a hash slice avoids this. You can easily feed this saving back into your other subs which would probably make your strings sub quickest again. But I thought I'd leave that AAEFTR:)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Re: Re: tokenize plain text messages
by revdiablo (Prior) on May 10, 2003 at 04:13 UTC | |
by BrowserUk (Patriarch) on May 10, 2003 at 06:03 UTC | |
by revdiablo (Prior) on May 10, 2003 at 06:51 UTC |