in reply to Create n-grams from tokenized text file
Please note that instead of printing "$n:" you could also push to an @array or print to a filehandle.
The filehandle or arrayref or whatever target stream could be held in a hash $ngrams{$n} ....
... this should be among the fastest solutions ¹
Also it's easy to change this from range to list of n-grams, just keep $n_max right.
HTH
use strict; use warnings; my @cache; my $n_min=2; my $n_max=6; while (my $line =<DATA>) { chomp $line; push @cache, $line; for my $n ($n_min..$n_max) { print "$n: @cache[-$n .. -1]\n" # last n of cache if @cache >= $n; } shift @cache if @cache == $n_max; # keep cache at max size } __DATA__ One Two Three Four Five Six Seven Eight Nine Ten
2: One Two 2: Two Three 3: One Two Three 2: Three Four 3: Two Three Four 4: One Two Three Four 2: Four Five 3: Three Four Five 4: Two Three Four Five 5: One Two Three Four Five 2: Five Six 3: Four Five Six 4: Three Four Five Six 5: Two Three Four Five Six 6: One Two Three Four Five Six 2: Six Seven 3: Five Six Seven 4: Four Five Six Seven 5: Three Four Five Six Seven 6: Two Three Four Five Six Seven 2: Seven Eight 3: Six Seven Eight 4: Five Six Seven Eight 5: Four Five Six Seven Eight 6: Three Four Five Six Seven Eight 2: Eight Nine 3: Seven Eight Nine 4: Six Seven Eight Nine 5: Five Six Seven Eight Nine 6: Four Five Six Seven Eight Nine 2: Nine Ten 3: Eight Nine Ten 4: Seven Eight Nine Ten 5: Six Seven Eight Nine Ten 6: Five Six Seven Eight Nine Ten
Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Je suis Charlie!
¹) well you could have a sliding window and a regex to be even faster ;)
simplified range code from @cache-$n ..$#cache to -$n .. -1
|
|---|