Re^2: improving speed in ngrams algorithm (updated)

A regex should be faster

I would already doubt that a regex is faster than accessing array elements in normal circumstances, but here you seem to have missed the fact that the n-grams are made of words rather than chars. So your regex becomes: /((\w+\s?){3})/g where each char of (part of) the string are checked to find spaces. In IB2017's solution this is done once by the split.

I know it's possible in a single regex without looping over start by playing around with \K or similar

Look ahead assertions can help:

  DB<7> say for 'perlmonks' =~ /(?=(.{3}))./g
per
erl
rlm
lmo
mon
onk
nks
[download]

But it becomes cumbersome when working with words /(?=((\w+\s?){3}))\w+/g and probably not faster.

In case you want really want to include non-letters try unpack

unpack would probably be among the fastest solutions for character n-grams indeed.

Comment on Re^2: improving speed in ngrams algorithm (updated) Select or Download Code

Replies are listed 'Best First'.
Re^3: improving speed in ngrams algorithm (updated) by LanX (Saint) on Jun 11, 2019 at 12:39 UTC
Seems like I misread the sample code. I saw `split //` not `split / /` That's why added the NB part saying to exclude white spaces and punctuation (which isn't done in the OP s code) I haven't run ° it but the code looks broken to me if the split wasn't meant to be per character. @string holding words doesn't make sense to me! I don't think that you can effectively process a natural language without regex. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice} Update °) I ran it on my mobile and the output shows that the OP is looking for n words in a row. Hence we both misunderstood his definition of n gram `START INDEX: 0 :this is START INDEX: 1 :is the START INDEX: 2 :the text START INDEX: 3 :text to START INDEX: 4 :to play START INDEX: 5 :play with START INDEX: 0 :this is the START INDEX: 1 :is the text START INDEX: 2 :the text to START INDEX: 3 :text to play START INDEX: 4 :to play with` [download]	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: improving speed in ngrams algorithm (updated)
by LanX (Saint) on Jun 11, 2019 at 12:39 UTC

I saw split // not split / /

That's why added the NB part saying to exclude white spaces and punctuation (which isn't done in the OP s code)

I haven't run ° it but the code looks broken to me if the split wasn't meant to be per character. @string holding words doesn't make sense to me!

I don't think that you can effectively process a natural language without regex.

Cheers Rolf
_{(addicted to the Perl Programming Language :)

Wikisyntax for the Monastery
FootballPerl is like chess, only without the dice}

Update

°) I ran it on my mobile and the output shows that the OP is looking for n words in a row. Hence we both misunderstood his definition of n gram

START INDEX: 0 :this is
START INDEX: 1 :is the
START INDEX: 2 :the text
START INDEX: 3 :text to
START INDEX: 4 :to play
START INDEX: 5 :play with
START INDEX: 0 :this is the
START INDEX: 1 :is the text
START INDEX: 2 :the text to
START INDEX: 3 :text to play
START INDEX: 4 :to play with
[download]

[reply]
[d/l]
[select]