IB2017 has asked for the wisdom of the Perl Monks concerning the following question:
Given string with text, I need to create n-grams of predefined lengths. I came up with the following. Any suggestions on how to improve it (being speed an important factor in my process?). The sentence, i.e. the array will contain typically 5-15 elements.
use strict; use warnings; my $sentence = "this is the text to play with"; my @string = split / /, $sentence; my $ngramWindow_MIN = 2; my $ngramWindow_MAX = 3; for ($ngramWindow_MIN .. $ngramWindow_MAX){ my $ngramWindow=$_; my $sizeString = (@string) - $ngramWindow; foreach (0 .. $sizeString){ print "START INDEX: $_ :"; print "@string[$_..($_+$ngramWindow-1)]\n"; } }
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: improving speed in ngrams algorithm
by holli (Abbot) on Jun 11, 2019 at 10:52 UTC | |
holli You can lead your users to water, but alas, you cannot drown them. | [reply] [d/l] [select] |
by karlgoethebier (Abbot) on Jun 11, 2019 at 19:33 UTC | |
“...do it in C though, even for a novice. If it's really time critical I would write it as XS.“ This statement is not good «The Crux of the Biscuit is the Apostrophe» perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help | [reply] [d/l] |
by LanX (Saint) on Jun 11, 2019 at 21:39 UTC | |
I'm sure there are already ready to use C programs for this available.
Cheers Rolf
| [reply] |
Re: improving speed in ngrams algorithm (updated)
by vr (Curate) on Jun 11, 2019 at 12:34 UTC | |
Update. I noticed that my solution uses variable introduced in 5.25.7. Then I looked up the %- hash, which, by description of it, should provide similar access and was there since 5.10. While it does indeed work, if combined in obvious way with grep to filter undefined values, it looks to me that length of array associated with named capture being not reset is a bug. The %{^CAPTURE_ALL} should be identical to %-, but looks to suffer from another bug even more. Tested in 5.28.0.
Update 2. See %{^CAPTURE}, %{^CAPTURE_ALL} and %- don't produce expected output. | [reply] [d/l] [select] |
by choroba (Cardinal) on Jun 11, 2019 at 16:30 UTC | |
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] |
Re: improving speed in ngrams algorithm
by Eily (Monsignor) on Jun 11, 2019 at 12:35 UTC | |
(being speed an important factor in my process?)Are you sure? Did you actually notice that the program was too slow, or do you just think being fast might be a good thing? 5-15 elements by array is very little. So unless you process several thousands of strings, micro optimizations won't have a noticeable impact. And if you process that many strings and see that the program is slow, there might be better places to improve it than this quite basic code. Like I said, it won't change much but one thing you can do is change into
| [reply] [d/l] [select] |
Re: improving speed in ngrams algorithm
by tybalt89 (Monsignor) on Jun 12, 2019 at 09:02 UTC | |
Benchmarking left to someone who cares :)
Outputs (same lines, slightly different order) :
| [reply] [d/l] [select] |
Re: improving speed in ngrams algorithm
by AnomalousMonk (Archbishop) on Jun 11, 2019 at 16:03 UTC | |
The (?= (pattern)) overlapping capture trick may be useful here.
If you need to capture info like
If it's all you need, it would be faster to capture "naked" n-grams:
Tested under Perl version 5.8.9. (I haven't done any Benchmark-ing on any of this :) Give a man a fish: <%-{-{-{-< | [reply] [d/l] [select] |
Re: improving speed in ngrams algorithm
by karlgoethebier (Abbot) on Jun 11, 2019 at 13:49 UTC | |
See also Text::Ngrams. Regards, Karl «The Crux of the Biscuit is the Apostrophe» perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help | [reply] [d/l] |
Re: improving speed in ngrams algorithm
by johngg (Canon) on Jun 12, 2019 at 10:21 UTC | |
A solution using split, array slices and shift. No idea if it is fast or slow as I haven't run any benchmarks.
The output.
I hope this is of interest. Cheers, JohnGG | [reply] [d/l] [select] |
Re: improving speed in ngrams algorithm (updated)
by LanX (Saint) on Jun 11, 2019 at 11:04 UTC | |
Please ignore! Misunderstood question.My answer treats ngrams on characters not words.
A regex should be faster, this demo in the debugger for n=3 should give you a start.
NB:
(I know it's possible in a single regex without looping over start by playing around with \K or similar. I'll leave it to the regex gurus like tybalt to show it ;-) HTH! :)
Cheers Rolf
updateIn case you want really want to include non-letters try unpack | [reply] [d/l] |
by Eily (Monsignor) on Jun 11, 2019 at 12:22 UTC | |
A regex should be fasterI would already doubt that a regex is faster than accessing array elements in normal circumstances, but here you seem to have missed the fact that the n-grams are made of words rather than chars. So your regex becomes: /((\w+\s?){3})/g where each char of (part of) the string are checked to find spaces. In IB2017's solution this is done once by the split. I know it's possible in a single regex without looping over start by playing around with \K or similarLook ahead assertions can help: But it becomes cumbersome when working with words /(?=((\w+\s?){3}))\w+/g and probably not faster. In case you want really want to include non-letters try unpackunpack would probably be among the fastest solutions for character n-grams indeed. | [reply] [d/l] [select] |
by LanX (Saint) on Jun 11, 2019 at 12:39 UTC | |
I saw split // not split / / That's why added the NB part saying to exclude white spaces and punctuation (which isn't done in the OP s code) I haven't run ° it but the code looks broken to me if the split wasn't meant to be per character. @string holding words doesn't make sense to me! I don't think that you can effectively process a natural language without regex.
Cheers Rolf
Update°) I ran it on my mobile and the output shows that the OP is looking for n words in a row. Hence we both misunderstood his definition of n gram
| [reply] [d/l] [select] |
Re: improving speed in ngrams algorithm (benchmark time! kindof)
by vr (Curate) on Jun 20, 2019 at 15:35 UTC | |
What if, while working environment is familiar Perl program, we use something completely different (benchmarks below, vocabulary here, or revised one there)?
Perl:
I pumped difficulty level just very slightly up, or otherwise (as example in OP) it would be ridiculous to optimize what's very fast as is. Note, J sentence is interpreted every time, so to be fair I should have wrapped other players into string eval. I tried to preserve other monks code while bending it to serve "array of ngrams, array of indexes" goal, where possible. Sorry if I messed. As I understand, to modify J phrase to work with Unicode text and/or return character offsets would be easy. + Of course my J is absolutely unoptimized, as I'm total beginner. The moral, there is very powerful tool and now I know how to use it from Perl :) Edit: fixed spelling, sorry. | [reply] [d/l] [select] |