PetaMem has asked for the wisdom of the Perl Monks concerning the following question:
while trying to take an own stab at NGrams (other solution), after some profiling work I've stumbled across the following code:
This one generates NGrams out of a given text just to compare them with already existing NGram distributions - precomputed so-called "Language Models".for $word (@large_wordlist) { $word = '_'.$word.'_'; my $len = length($word); my $flen = $len; my $i; for ($i=0; $i<$flen; $i++) { $$ngram{substr($word,$i,5)}++ if $len > 4; $$ngram{substr($word,$i,4)}++ if $len > 3; $$ngram{substr($word,$i,3)}++ if $len > 2; $$ngram{substr($word,$i,2)}++ if $len > 1; $$ngram{substr($word,$i,1)}++; $len--; } }
The problem is, that if you have to process a huge batch of incoming textfiles, the code above happens to be a bottleneck soon. Though I have the *feeling* that there is something unelegant thus wrong with this code (many repetitions), I cannot see how (if ever) to make it faster.
Any hints?
Bye
PetaMem All Perl: MT, NLP, NLU
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Fasten up NGram generation code?
by Abigail-II (Bishop) on Jan 08, 2004 at 18:13 UTC | |
|
Re: Fasten up NGram generation code?
by ysth (Canon) on Jan 08, 2004 at 19:44 UTC | |
by Abigail-II (Bishop) on Jan 09, 2004 at 09:18 UTC | |
by ysth (Canon) on Jan 09, 2004 at 16:41 UTC | |
by Abigail-II (Bishop) on Jan 09, 2004 at 20:43 UTC | |
|
Re: Fasten up NGram generation code?
by Abigail-II (Bishop) on Jan 09, 2004 at 09:28 UTC |