perl-versions >5.9.2 have a trie optimization within the regex engine.

That is /(aaa|aab|aca)/ is internally optmized to (a(a(a|b)|ca))

so if you organize your $lexicon in a way where supplementary dictionary data are listed after the target-words and delimited with something like "\0" you can search quite efficiently

$patterns = join "|",@patterns; @matches = ($lexicon =~ /\0\0($patterns)\0([^\0]+)/g );

(untested)

I successfully wrote a module parsing DB-dumps very efficiently like this.

Unfortunately the rights belong to my last employer, so you need to reinvent the wheel...:(

UPDATE

after rereading your post I have the impression that it's your lexicon which is static while the "tweets" always change.

In this case you have the swap the logic, just once produce a long regex out of the phrases in your lexicon and match them against all tweets.

Take care to sort the phrases by length, cause the first match will rule. Like this you don't to embed the dictionary data, just do a hash lookup with the matching word-groups.

Cheers Rolf

( addicted to the Perl Programming Language)


In reply to Re: Efficient matching with accompanying data by LanX
in thread Efficient matching with accompanying data by Endless

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.