while trying to take an own stab at NGrams (other solution), after some profiling work I've stumbled across the following code:
This one generates NGrams out of a given text just to compare them with already existing NGram distributions - precomputed so-called "Language Models".for $word (@large_wordlist) { $word = '_'.$word.'_'; my $len = length($word); my $flen = $len; my $i; for ($i=0; $i<$flen; $i++) { $$ngram{substr($word,$i,5)}++ if $len > 4; $$ngram{substr($word,$i,4)}++ if $len > 3; $$ngram{substr($word,$i,3)}++ if $len > 2; $$ngram{substr($word,$i,2)}++ if $len > 1; $$ngram{substr($word,$i,1)}++; $len--; } }
The problem is, that if you have to process a huge batch of incoming textfiles, the code above happens to be a bottleneck soon. Though I have the *feeling* that there is something unelegant thus wrong with this code (many repetitions), I cannot see how (if ever) to make it faster.
Any hints?
Bye
PetaMem All Perl: MT, NLP, NLU
In reply to Fasten up NGram generation code? by PetaMem
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |