PetaMem has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

while trying to take an own stab at NGrams (other solution), after some profiling work I've stumbled across the following code:

for $word (@large_wordlist) { $word = '_'.$word.'_'; my $len = length($word); my $flen = $len; my $i; for ($i=0; $i<$flen; $i++) { $$ngram{substr($word,$i,5)}++ if $len > 4; $$ngram{substr($word,$i,4)}++ if $len > 3; $$ngram{substr($word,$i,3)}++ if $len > 2; $$ngram{substr($word,$i,2)}++ if $len > 1; $$ngram{substr($word,$i,1)}++; $len--; } }
This one generates NGrams out of a given text just to compare them with already existing NGram distributions - precomputed so-called "Language Models".

The problem is, that if you have to process a huge batch of incoming textfiles, the code above happens to be a bottleneck soon. Though I have the *feeling* that there is something unelegant thus wrong with this code (many repetitions), I cannot see how (if ever) to make it faster.

Any hints?

Bye
 PetaMem
    All Perl:   MT, NLP, NLU

Replies are listed 'Best First'.
Re: Fasten up NGram generation code?
by Abigail-II (Bishop) on Jan 08, 2004 at 18:13 UTC
    While this won't be faster, you can replace the entire loop with the following one-liner:
    "_${_}_" =~ /(.{1,5})(?{$$ngram {$^N} ++})(?!)/ for @large_wordlist;

    Abigail

Re: Fasten up NGram generation code?
by ysth (Canon) on Jan 08, 2004 at 19:44 UTC
    Don't know about faster, but I think this is more elegant:
    ++$$ngram{$_} for grep defined, "_${word}_" =~ /(?=(((((.).).).).))|(?=((((.).).).))| (?=(((.).).))|(?=((.).))|(?=(.))/gx;
    Update: Perhaps I had better say, dog slow, but pretty. Thanks for the benchmark, Abigail. This version adds .'s to avoid MINMATCH and is only slightly faster.
    $ngram_y2 = {}; for my $word (@large_wordlist) { ++ $$ngram_y2 {$_} for grep defined, "_${word}_" =~ / (?=(((((.).).).).)). | (?=((((.).).).)) . | (?=(((.).).)) . | (?=((.).)) . | (?=(.)) ./gx; }
      More elegant? I do have to disagree with that. Your regex is awkward to modify it to ngrams of different sizes. What if you want to count all substrings up to a length of 10? Or what if you want to count all substrings?

      Abigail

        To clarify, I meant more elegant than the OP, not than your regex.
Re: Fasten up NGram generation code?
by Abigail-II (Bishop) on Jan 09, 2004 at 09:28 UTC
    Here is a benchmark:
    #!/usr/bin/perl use strict; use warnings; use Benchmark qw /timethese cmpthese/; my $file = "/usr/share/dict/words"; chomp (our @large_wordlist = `cat $file`); our ($ngram_f, $ngram_a, $ngram_y); cmpthese -60 => { fatvamp => '$ngram_f = {}; for my $w (@large_wordlist) { $word = "_" . $w . "_"; my $len = length($word); my $flen = $len; for (my $i = 0; $i < $flen; $i ++) { $$ngram_f {substr ($word, $i, 5)} ++ if $len +> 4; $$ngram_f {substr ($word, $i, 4)} ++ if $len +> 3; $$ngram_f {substr ($word, $i, 3)} ++ if $len +> 2; $$ngram_f {substr ($word, $i, 2)} ++ if $len +> 1; $$ngram_f {substr ($word, $i, 1)} ++; $len --; } }', abigail => '$ngram_a = {}; "_${_}_" =~ /(.{1,5})(?{$$ngram_a {$^N} ++})(?!)/ for @large_wordlist;', ysth => '$ngram_y = {}; for my $word (@large_wordlist) { ++ $$ngram_y {$_} for grep defined, "_${word}_" =~ /(?=(((((.).).).).))|(?=((((.).).) +.))| (?=(((.).).))|(?=((.).))|(?=(.) +)/gx; }', }; my @keys_f = sort keys %$ngram_f; my @keys_a = sort keys %$ngram_a; my @keys_y = sort keys %$ngram_y; die "Wrong keys\n" unless "@keys_f" eq "@keys_a" && "@keys_f" eq "@keys_y"; my @values_f = @$ngram_f {@keys_f}; my @values_a = @$ngram_a {@keys_a}; my @values_y = @$ngram_y {@keys_y}; die "Wrong values\n" unless "@values_f" eq "@values_a" && "@values_f" eq "@values_y"; __END__ s/iter ysth abigail fatvamp ysth 7.08 -- -48% -62% abigail 3.70 92% -- -28% fatvamp 2.68 164% 38% --

    Abigail