comment on

The (?= (pattern)) overlapping capture trick may be useful here.

If you need to capture info like ~~offsets~~ | character offsets within the sentence, maybe something like:

c:\@Work\Perl\monks>perl -wMstrict -le
"my $sentence = 'this is the text to play with';
 ;;
 my $ngramWindow_MIN = 1;
 my $ngramWindow_MAX = 3;
 ;;
 my @word_ngrams;
 ;;
 for my $ngramWindow ($ngramWindow_MIN .. $ngramWindow_MAX) {
   my $m = $ngramWindow - 1;
   my $ngram = qr{ \b [[:alpha:]]+ (?: \s+ [[:alpha:]]+){$m} \b }xms;
   ;;
   while ($sentence =~ m{ (?= ($ngram)) }xmsg) {
     push @word_ngrams, [ $1, $-[1] ];
     }
   }
 ;;
 for my $ar_wng (@word_ngrams) {
   my ($word_ngram, $sentence_offset) = @$ar_wng;
   print qq{'$word_ngram' at sentence offset $sentence_offset};
   }
"
'this' at sentence offset 0
'is' at sentence offset 5
'the' at sentence offset 8
'text' at sentence offset 12
'to' at sentence offset 17
'play' at sentence offset 20
'with' at sentence offset 25
'this is' at sentence offset 0
'is the' at sentence offset 5
'the text' at sentence offset 8
'text to' at sentence offset 12
'to play' at sentence offset 17
'play with' at sentence offset 20
'this is the' at sentence offset 0
'is the text' at sentence offset 5
'the text to' at sentence offset 8
'text to play' at sentence offset 12
'to play with' at sentence offset 17
[download]

If it's all you need, it would be faster to capture "naked" n-grams:

c:\@Work\Perl\monks>perl -wMstrict -le
"my $sentence = 'this is the text to play with';
 ;;
 my $ngramWindow_MIN = 1;
 my $ngramWindow_MAX = 3;
 ;;
 for my $ngramWindow ($ngramWindow_MIN .. $ngramWindow_MAX) {
   print qq{$ngramWindow-word ngrams of '$sentence'};
   my $m = $ngramWindow - 1;
   my $ngram = qr{ \b [[:alpha:]]+ (?: \s+ [[:alpha:]]+){$m} \b }xms;
   ;;
   my @word_ngrams = $sentence =~ m{ (?= ($ngram)) }xmsg;
   ;;
   for my $word_ngram (@word_ngrams) {
     print qq{  '$word_ngram'};
     }
   }
"
1-word ngrams of 'this is the text to play with'
  'this'
  'is'
  'the'
  'text'
  'to'
  'play'
  'with'
2-word ngrams of 'this is the text to play with'
  'this is'
  'is the'
  'the text'
  'text to'
  'to play'
  'play with'
3-word ngrams of 'this is the text to play with'
  'this is the'
  'is the text'
  'the text to'
  'text to play'
  'to play with'
[download]

Tested under Perl version 5.8.9. (I haven't done any Benchmark-ing on any of this :)

Give a man a fish: <%-{-{-{-<

In reply to Re: improving speed in ngrams algorithm by AnomalousMonk
in thread improving speed in ngrams algorithm by IB2017

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.