The  (?= (pattern)) overlapping capture trick may be useful here.

If you need to capture info like offsets | character offsets within the sentence, maybe something like:

c:\@Work\Perl\monks>perl -wMstrict -le "my $sentence = 'this is the text to play with'; ;; my $ngramWindow_MIN = 1; my $ngramWindow_MAX = 3; ;; my @word_ngrams; ;; for my $ngramWindow ($ngramWindow_MIN .. $ngramWindow_MAX) { my $m = $ngramWindow - 1; my $ngram = qr{ \b [[:alpha:]]+ (?: \s+ [[:alpha:]]+){$m} \b }xms; ;; while ($sentence =~ m{ (?= ($ngram)) }xmsg) { push @word_ngrams, [ $1, $-[1] ]; } } ;; for my $ar_wng (@word_ngrams) { my ($word_ngram, $sentence_offset) = @$ar_wng; print qq{'$word_ngram' at sentence offset $sentence_offset}; } " 'this' at sentence offset 0 'is' at sentence offset 5 'the' at sentence offset 8 'text' at sentence offset 12 'to' at sentence offset 17 'play' at sentence offset 20 'with' at sentence offset 25 'this is' at sentence offset 0 'is the' at sentence offset 5 'the text' at sentence offset 8 'text to' at sentence offset 12 'to play' at sentence offset 17 'play with' at sentence offset 20 'this is the' at sentence offset 0 'is the text' at sentence offset 5 'the text to' at sentence offset 8 'text to play' at sentence offset 12 'to play with' at sentence offset 17

If it's all you need, it would be faster to capture "naked" n-grams:

c:\@Work\Perl\monks>perl -wMstrict -le "my $sentence = 'this is the text to play with'; ;; my $ngramWindow_MIN = 1; my $ngramWindow_MAX = 3; ;; for my $ngramWindow ($ngramWindow_MIN .. $ngramWindow_MAX) { print qq{$ngramWindow-word ngrams of '$sentence'}; my $m = $ngramWindow - 1; my $ngram = qr{ \b [[:alpha:]]+ (?: \s+ [[:alpha:]]+){$m} \b }xms; ;; my @word_ngrams = $sentence =~ m{ (?= ($ngram)) }xmsg; ;; for my $word_ngram (@word_ngrams) { print qq{ '$word_ngram'}; } } " 1-word ngrams of 'this is the text to play with' 'this' 'is' 'the' 'text' 'to' 'play' 'with' 2-word ngrams of 'this is the text to play with' 'this is' 'is the' 'the text' 'text to' 'to play' 'play with' 3-word ngrams of 'this is the text to play with' 'this is the' 'is the text' 'the text to' 'text to play' 'to play with'

Tested under Perl version 5.8.9. (I haven't done any Benchmark-ing on any of this :)


Give a man a fish:  <%-{-{-{-<


In reply to Re: improving speed in ngrams algorithm by AnomalousMonk
in thread improving speed in ngrams algorithm by IB2017

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.