in reply to improving speed in ngrams algorithm

Worth a try, since regexp compiled just once...
use strict; use warnings; use feature 'say'; my $sentence = "this is the text to play with"; my $ngramWindow_MIN = 2; my $ngramWindow_MAX = 3; my $word = qr/(\b\S+)(?:\s|$)/; # "word" is $1, rather my $ngram = join '', $word x $ngramWindow_MIN, qr/(?:$word)?/ x ( $ngramWindow_MAX - $ngramWindow_MIN ); my $re = qr/$ngram(?{ say "@{^CAPTURE}" # or do anything with @words i.e. @{^CAPTURE} })(*F)/; $sentence =~ /$re/g;

Update. I noticed that my solution uses variable introduced in 5.25.7. Then I looked up the %- hash, which, by description of it, should provide similar access and was there since 5.10. While it does indeed work, if combined in obvious way with grep to filter undefined values, it looks to me that length of array associated with named capture being not reset is a bug. The %{^CAPTURE_ALL} should be identical to %-, but looks to suffer from another bug even more. Tested in 5.28.0.

use strict; use warnings; use Data::Dump 'dd'; my $sentence = "this is the text to play with"; my $ngramWindow_MIN = 2; my $ngramWindow_MAX = 3; my $word = qr/(?<word>\b\S+)(?:\s|$)/; my $ngram = join '', $word x $ngramWindow_MIN, qr/(?:$word)?/ x ( $ngramWindow_MAX - $ngramWindow_MIN ); my $re = qr/$ngram(?{ dd \@{^CAPTURE}; dd $-{word}; dd ${^CAPTURE_ALL}{word}; print "\n"; })(*F)/; $sentence =~ /$re/g; __END__ ["this", "is", "the"] ["this", "is", "the"] "this" ["this", "is"] ["this", "is", undef] "this" ["is", "the", "text"] ["is", "the", "text"] "is" ["is", "the"] ["is", "the", undef] "is" ["the", "text", "to"] ["the", "text", "to"] "the" ["the", "text"] ["the", "text", undef] "the" ["text", "to", "play"] ["text", "to", "play"] "text" ["text", "to"] ["text", "to", undef] "text" ["to", "play", "with"] ["to", "play", "with"] "to" ["to", "play"] ["to", "play", undef] "to" ["play", "with"] ["play", "with", undef] "play"

Update 2. See %{^CAPTURE}, %{^CAPTURE_ALL} and %- don't produce expected output.

Replies are listed 'Best First'.
Re^2: improving speed in ngrams algorithm (updated)
by choroba (Cardinal) on Jun 11, 2019 at 16:30 UTC
    Can you describe the bug using a simpler example, maybe as a new SoPW node? I'm not sure I understand - please include the expected output, too.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]