Fasten up NGram generation code?

PetaMem has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

while trying to take an own stab at NGrams (other solution), after some profiling work I've stumbled across the following code:

  for $word (@large_wordlist) {
    $word = '_'.$word.'_';
    my $len  = length($word);
    my $flen = $len;    
    my $i;

    for ($i=0; $i<$flen; $i++) {
      $$ngram{substr($word,$i,5)}++ if $len > 4;
      $$ngram{substr($word,$i,4)}++ if $len > 3;
      $$ngram{substr($word,$i,3)}++ if $len > 2;
      $$ngram{substr($word,$i,2)}++ if $len > 1;
      $$ngram{substr($word,$i,1)}++;
      $len--;
    }
}
[download]

This one generates NGrams out of a given text just to compare them with already existing NGram distributions - precomputed so-called "Language Models".

The problem is, that if you have to process a huge batch of incoming textfiles, the code above happens to be a bottleneck soon. Though I have the *feeling* that there is something unelegant thus wrong with this code (many repetitions), I cannot see how (if ever) to make it faster.

Any hints?

Bye
PetaMem All Perl: MT, NLP, NLU

Comment on Fasten up NGram generation code? Download Code

Replies are listed 'Best First'.
Re: Fasten up NGram generation code? by Abigail-II (Bishop) on Jan 08, 2004 at 18:13 UTC
While this won't be faster, you can replace the entire loop with the following one-liner: `"_${_}_" =~ /(.{1,5})(?{$$ngram {$^N} ++})(?!)/ for @large_wordlist;` [download] Abigail	[reply] [d/l]
Re: Fasten up NGram generation code? by ysth (Canon) on Jan 08, 2004 at 19:44 UTC
Don't know about faster, but I think this is more elegant: `++$$ngram{$_} for grep defined, "_${word}_" =~ /(?=(((((.).).).).))\|(?=((((.).).).))\| (?=(((.).).))\|(?=((.).))\|(?=(.))/gx;` [download] Update: Perhaps I had better say, dog slow, but pretty. Thanks for the benchmark, Abigail. This version adds .'s to avoid MINMATCH and is only slightly faster. `$ngram_y2 = {}; for my $word (@large_wordlist) { ++ $$ngram_y2 {$_} for grep defined, "_${word}_" =~ / (?=(((((.).).).).)). \| (?=((((.).).).)) . \| (?=(((.).).)) . \| (?=((.).)) . \| (?=(.)) ./gx; }` [download]	[reply] [d/l] [select]
Re: Fasten up NGram generation code? by Abigail-II (Bishop) on Jan 09, 2004 at 09:18 UTC
More elegant? I do have to disagree with that. Your regex is awkward to modify it to ngrams of different sizes. What if you want to count all substrings up to a length of 10? Or what if you want to count all substrings? Abigail	[reply]
Re: Re: Fasten up NGram generation code? by ysth (Canon) on Jan 09, 2004 at 16:41 UTC
To clarify, I meant more elegant than the OP, not than your regex.	[reply]
Re: Fasten up NGram generation code? by Abigail-II (Bishop) on Jan 09, 2004 at 20:43 UTC
Re: Fasten up NGram generation code? by Abigail-II (Bishop) on Jan 09, 2004 at 09:28 UTC
Here is a benchmark: #!/usr/bin/perl use strict; use warnings; use Benchmark qw /timethese cmpthese/; my $file = "/usr/share/dict/words"; chomp (our @large_wordlist = `cat $file`); our ($ngram_f, $ngram_a, $ngram_y); cmpthese -60 => { fatvamp => '$ngram_f = {}; for my $w (@large_wordlist) { $word = "_" . $w . "_"; my $len = length($word); my $flen = $len; for (my $i = 0; $i < $flen; $i ++) { $$ngram_f {substr ($word, $i, 5)} ++ if $len +> 4; $$ngram_f {substr ($word, $i, 4)} ++ if $len +> 3; $$ngram_f {substr ($word, $i, 3)} ++ if $len +> 2; $$ngram_f {substr ($word, $i, 2)} ++ if $len +> 1; $$ngram_f {substr ($word, $i, 1)} ++; $len --; } }', abigail => '$ngram_a = {}; "_${_}_" =~ /(.{1,5})(?{$$ngram_a {$^N} ++})(?!)/ for @large_wordlist;', ysth => '$ngram_y = {}; for my $word (@large_wordlist) { ++ $$ngram_y {$_} for grep defined, "_${word}_" =~ /(?=(((((.).).).).))\|(?=((((.).).) +.))\| (?=(((.).).))\|(?=((.).))\|(?=(.) +)/gx; }', }; my @keys_f = sort keys %$ngram_f; my @keys_a = sort keys %$ngram_a; my @keys_y = sort keys %$ngram_y; die "Wrong keys\n" unless "@keys_f" eq "@keys_a" && "@keys_f" eq "@keys_y"; my @values_f = @$ngram_f {@keys_f}; my @values_a = @$ngram_a {@keys_a}; my @values_y = @$ngram_y {@keys_y}; die "Wrong values\n" unless "@values_f" eq "@values_a" && "@values_f" eq "@values_y"; __END__ s/iter ysth abigail fatvamp ysth 7.08 -- -48% -62% abigail 3.70 92% -- -28% fatvamp 2.68 164% 38% -- [download] Abigail	[reply] [d/l]