in reply to improving speed in ngrams algorithm
use strict; use warnings; use feature 'say'; my $sentence = "this is the text to play with"; my $ngramWindow_MIN = 2; my $ngramWindow_MAX = 3; my $word = qr/(\b\S+)(?:\s|$)/; # "word" is $1, rather my $ngram = join '', $word x $ngramWindow_MIN, qr/(?:$word)?/ x ( $ngramWindow_MAX - $ngramWindow_MIN ); my $re = qr/$ngram(?{ say "@{^CAPTURE}" # or do anything with @words i.e. @{^CAPTURE} })(*F)/; $sentence =~ /$re/g;
Update. I noticed that my solution uses variable introduced in 5.25.7. Then I looked up the %- hash, which, by description of it, should provide similar access and was there since 5.10. While it does indeed work, if combined in obvious way with grep to filter undefined values, it looks to me that length of array associated with named capture being not reset is a bug. The %{^CAPTURE_ALL} should be identical to %-, but looks to suffer from another bug even more. Tested in 5.28.0.
use strict; use warnings; use Data::Dump 'dd'; my $sentence = "this is the text to play with"; my $ngramWindow_MIN = 2; my $ngramWindow_MAX = 3; my $word = qr/(?<word>\b\S+)(?:\s|$)/; my $ngram = join '', $word x $ngramWindow_MIN, qr/(?:$word)?/ x ( $ngramWindow_MAX - $ngramWindow_MIN ); my $re = qr/$ngram(?{ dd \@{^CAPTURE}; dd $-{word}; dd ${^CAPTURE_ALL}{word}; print "\n"; })(*F)/; $sentence =~ /$re/g; __END__ ["this", "is", "the"] ["this", "is", "the"] "this" ["this", "is"] ["this", "is", undef] "this" ["is", "the", "text"] ["is", "the", "text"] "is" ["is", "the"] ["is", "the", undef] "is" ["the", "text", "to"] ["the", "text", "to"] "the" ["the", "text"] ["the", "text", undef] "the" ["text", "to", "play"] ["text", "to", "play"] "text" ["text", "to"] ["text", "to", undef] "text" ["to", "play", "with"] ["to", "play", "with"] "to" ["to", "play"] ["to", "play", undef] "to" ["play", "with"] ["play", "with", undef] "play"
Update 2. See %{^CAPTURE}, %{^CAPTURE_ALL} and %- don't produce expected output.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: improving speed in ngrams algorithm (updated)
by choroba (Cardinal) on Jun 11, 2019 at 16:30 UTC |