Re: Efficient matching with accompanying data

perl-versions >5.9.2 have a trie optimization within the regex engine.

That is /(aaa|aab|aca)/ is internally optmized to (a(a(a|b)|ca))

so if you organize your $lexicon in a way where supplementary dictionary data are listed after the target-words and delimited with something like "\0" you can search quite efficiently

 
$patterns = join "|",@patterns;

@matches = ($lexicon =~ /\0\0($patterns)\0([^\0]+)/g );
[download]

(untested)

I successfully wrote a module parsing DB-dumps very efficiently like this.

Unfortunately the rights belong to my last employer, so you need to reinvent the wheel...:(

UPDATE

after rereading your post I have the impression that it's your lexicon which is static while the "tweets" always change.

In this case you have the swap the logic, just once produce a long regex out of the phrases in your lexicon and match them against all tweets.

Take care to sort the phrases by length, cause the first match will rule. Like this you don't to embed the dictionary data, just do a hash lookup with the matching word-groups.

Cheers Rolf

( addicted to the Perl Programming Language)

Comment on Re: Efficient matching with accompanying data Select or Download Code

Replies are listed 'Best First'.
Re^2: Efficient matching with accompanying data by LanX (Saint) on Jul 11, 2013 at 01:43 UTC
proof of concept `DB<137> %lexicon=("day"=>1,"night"=>2,"knight"=>3) => ("day", 1, "night", 2, "knight", 3) DB<138> $pattern = join "\|",sort {length($b)<=>length( $a) } keys %l +exicon => "knight\|night\|day" DB<139> $tweet= "today I will knight a guy I met last night" => "today I will knight a guy I met last night" DB<140> @matches = ( $tweet =~ /($pattern)/g ) => ("day", "knight", "night") DB<141> @lexicon{@matches} => (1, 3, 2)` [download] if you need word boundaries try `map { "\\b$_\\b" }` between `join` and `sort` `DB<146> $pattern = join "\|", map { "\\b$_\\b" } sort {length($b)<=>l +ength( $a) } keys %lexicon => "\\bknight\\b\|\\bnight\\b\|\\bday\\b" DB<147> @matches = ( $tweet =~ /($pattern)/g ) => ("knight", "night")` [download] Cheers Rolf ( addicted to the Perl Programming Language)	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Efficient matching with accompanying data
by LanX (Saint) on Jul 11, 2013 at 01:43 UTC

  DB<137> %lexicon=("day"=>1,"night"=>2,"knight"=>3)
 => ("day", 1, "night", 2, "knight", 3)

  DB<138> $pattern = join "|",sort {length($b)<=>length( $a) } keys %l
+exicon
 => "knight|night|day"

  DB<139> $tweet= "today I will knight a guy I met last night"
 => "today I will knight a guy I met last night"

  DB<140> @matches =  ( $tweet =~ /($pattern)/g )
 => ("day", "knight", "night")

  DB<141>  @lexicon{@matches}
 => (1, 3, 2)
[download]

if you need word boundaries try map { "\\b$_\\b" } between join and sort

  DB<146> $pattern = join "|", map { "\\b$_\\b" } sort {length($b)<=>l
+ength( $a) } keys %lexicon
 => "\\bknight\\b|\\bnight\\b|\\bday\\b"

  DB<147> @matches =  ( $tweet =~ /($pattern)/g )
 => ("knight", "night")
[download]

Cheers Rolf

( addicted to the Perl Programming Language)

[reply]
[d/l]
[select]