comment on

Hello Monks,

while trying to take an own stab at NGrams (other solution), after some profiling work I've stumbled across the following code:

  for $word (@large_wordlist) {
    $word = '_'.$word.'_';
    my $len  = length($word);
    my $flen = $len;    
    my $i;

    for ($i=0; $i<$flen; $i++) {
      $$ngram{substr($word,$i,5)}++ if $len > 4;
      $$ngram{substr($word,$i,4)}++ if $len > 3;
      $$ngram{substr($word,$i,3)}++ if $len > 2;
      $$ngram{substr($word,$i,2)}++ if $len > 1;
      $$ngram{substr($word,$i,1)}++;
      $len--;
    }
}
[download]

This one generates NGrams out of a given text just to compare them with already existing NGram distributions - precomputed so-called "Language Models".

The problem is, that if you have to process a huge batch of incoming textfiles, the code above happens to be a bottleneck soon. Though I have the *feeling* that there is something unelegant thus wrong with this code (many repetitions), I cannot see how (if ever) to make it faster.

Any hints?

Bye
PetaMem All Perl: MT, NLP, NLU

In reply to Fasten up NGram generation code? by PetaMem

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.