in reply to Efficient matching with accompanying data
That is /(aaa|aab|aca)/ is internally optmized to (a(a(a|b)|ca))
so if you organize your $lexicon in a way where supplementary dictionary data are listed after the target-words and delimited with something like "\0" you can search quite efficiently
$patterns = join "|",@patterns; @matches = ($lexicon =~ /\0\0($patterns)\0([^\0]+)/g );
(untested)
I successfully wrote a module parsing DB-dumps very efficiently like this.
Unfortunately the rights belong to my last employer, so you need to reinvent the wheel...:(
after rereading your post I have the impression that it's your lexicon which is static while the "tweets" always change.
In this case you have the swap the logic, just once produce a long regex out of the phrases in your lexicon and match them against all tweets.
Take care to sort the phrases by length, cause the first match will rule. Like this you don't to embed the dictionary data, just do a hash lookup with the matching word-groups.
Cheers Rolf
( addicted to the Perl Programming Language)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Efficient matching with accompanying data
by LanX (Saint) on Jul 11, 2013 at 01:43 UTC |