Re: improving speed in ngrams algorithm (updated)

Please ignore! Misunderstood question.

My answer treats ngrams on characters not words.

A regex should be faster, this demo in the debugger for n=3 should give you a start.

  DB<30> $str = join "", a..l

  DB<31> @res=()

  DB<32> for my $start (0..2) { pos($str) =$start; push @res, $str =~ 
+m/(.{3})/g }

  DB<33> x @res
0  'abc'
1  'def'
2  'ghi'
3  'jkl'
4  'bcd'
5  'efg'
6  'hij'
7  'cde'
8  'fgh'
9  'ijk'
[download]

NB:

the order is not preserved
you may want to change the regex to not match whitespaces or punctuation.

(I know it's possible in a single regex without looping over start by playing around with \K or similar. I'll leave it to the regex gurus like tybalt to show it ;-)

HTH! :)

Cheers Rolf
_{(addicted to the Perl Programming Language :)

Wikisyntax for the Monastery
FootballPerl is like chess, only without the dice}

update

In case you want really want to include non-letters try unpack

Comment on Re: improving speed in ngrams algorithm (updated) Download Code

Replies are listed 'Best First'.
Re^2: improving speed in ngrams algorithm (updated) by Eily (Monsignor) on Jun 11, 2019 at 12:22 UTC
A regex should be faster I would already doubt that a regex is faster than accessing array elements in normal circumstances, but here you seem to have missed the fact that the n-grams are made of words rather than chars. So your regex becomes: `/((\w+\s?){3})/g` where each char of (part of) the string are checked to find spaces. In IB2017's solution this is done once by the split. I know it's possible in a single regex without looping over start by playing around with \K or similar Look ahead assertions can help: `DB<7> say for 'perlmonks' =~ /(?=(.{3}))./g per erl rlm lmo mon onk nks` [download] But it becomes cumbersome when working with words `/(?=((\w+\s?){3}))\w+/g` and probably not faster. In case you want really want to include non-letters try unpack unpack would probably be among the fastest solutions for character n-grams indeed.	[reply] [d/l] [select]
Re^3: improving speed in ngrams algorithm (updated) by LanX (Saint) on Jun 11, 2019 at 12:39 UTC
Seems like I misread the sample code. I saw `split //` not `split / /` That's why added the NB part saying to exclude white spaces and punctuation (which isn't done in the OP s code) I haven't run ° it but the code looks broken to me if the split wasn't meant to be per character. @string holding words doesn't make sense to me! I don't think that you can effectively process a natural language without regex. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice} Update °) I ran it on my mobile and the output shows that the OP is looking for n words in a row. Hence we both misunderstood his definition of n gram `START INDEX: 0 :this is START INDEX: 1 :is the START INDEX: 2 :the text START INDEX: 3 :text to START INDEX: 4 :to play START INDEX: 5 :play with START INDEX: 0 :this is the START INDEX: 1 :is the text START INDEX: 2 :the text to START INDEX: 3 :text to play START INDEX: 4 :to play with` [download]	[reply] [d/l] [select]

Please ignore! Misunderstood question.

update

Update