Try this version of your ....oneregex sub. (Phew! Long names:).

D:\Perl\test>257026 192.168.1.1 | $2.50 | Keep | example | and | is | This | 1,500 | an Rate lists strings regex lists 757/s -- -56% -61% strings 1736/s 129% -- -10% regex 1930/s 155% 11% --

As the regex never changes, there is no need to recompile it every time you call the sub, so I made it a constant. I would usually put the use constant.. inside the sub where it is used to keep it tidy, but I got beaten up because it implies that the constant is lexically scoped which it isn't. This doesn't fool me, but it is your choice.

use constant RE_WORDS => qr[(?:[^\w\'\$!,.-]|(?:(?<=\D)[.,])|(?:[.,](? +=\D|$)))+]; sub tokenize_msg_w_oneregex { my %words; @words{ split RE_WORDS, shift } = (); return keys %words; }

To be fair, a large part of the savings is avoiding the map and initialising every hash element to 1 when you will never use the value. Doing the split inside a hash slice avoids this. You can easily feed this saving back into your other subs which would probably make your strings sub quickest again. But I thought I'd leave that AAEFTR:)


In reply to Re: Re: Re: tokenize plain text messages by BrowserUk
in thread tokenize plain text messages by revdiablo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.