Create n-grams from tokenized text file

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Create n-grams from tokenized text file by Marshall (Canon) on May 15, 2016 at 12:07 UTC
Another possible solution for you...Since no array of the all the "line words" are created, this scales linearly with file size. `#!/usr/bin/perl use warnings; use strict; my @tokens; my $n=2; #adjust as needed while (my $line = <DATA>) { chomp $line; push @tokens, $line; next if @tokens <$n; print "@tokens\n"; shift @tokens; } =prints: N=2.... Hello I I am am happy happy this this is is the the text text to to play play with N=3.... Hello I am I am happy am happy this happy this is this is the is the text the text to text to play to play with =cut __DATA__ Hello I am happy this is the text to play with` [download]	[reply] [d/l]
Re: Create n-grams from tokenized text file by BrowserUk (Patriarch) on May 15, 2016 at 11:21 UTC
This assumes you can get the words into an array: @x = qw[ this is the text to play with ];; $n = 2; print @x[ $_ .. $_+$n-1 ] for 0 .. (@x - $n);; this is is the the text text to to play play with $n = 3; print @x[ $_ .. $_+$n-1 ] for 0 .. (@x - $n);; this is the is the text the text to text to play to play with $n = 4; print @x[ $_ .. $_+$n-1 ] for 0 .. (@x - $n);; this is the text is the text to the text to play text to play with $n = 5; print @x[ $_ .. $_+$n-1 ] for 0 .. (@x - $n);; this is the text to is the text to play the text to play with [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: Create n-grams from tokenized text file by haukex (Archbishop) on May 15, 2016 at 12:21 UTC
Hi Anonymous, You're reading your entire file into memory, which isn't necessary because it looks like all you want is a sliding window over a certain number of lines from the file. This can be done by pushing each element onto the end of an array and then shifting elements off the beginning. This can even be done in a one-liner - see perl's `-n` and `-l` switches in perlrun: `$ perl -wnle 'push @x, $_; if(@x>=2) {print "@x"; shift @x}' input.txt Hello I I am am happy $ perl -wnle 'push @x, $_; if(@x>=3) {print "@x"; shift @x}' input.txt Hello I am I am happy` [download] Another solution might be the core module Tie::File. Note I haven't tested this for efficiency, but since the module doesn't read the entire file into memory this solution should still be better than doing that. The module does have to scan the file once to get the number of records contained within: `#!/usr/bin/env perl use warnings; use strict; use Tie::File; use Fcntl 'O_RDONLY'; die "Usage: $0 INPUTFILE WINSIZE\n" unless @ARGV==2; my ($INPUTFILE,$WINSIZE) = @ARGV; tie my @array, 'Tie::File', $INPUTFILE, mode => O_RDONLY; $, = " "; $\ = "\n"; # output field/record separators for my $i (0..@array-$WINSIZE) { print @array[$i..$i+$WINSIZE-1]; }` [download] Example usage: `$ perl window.pl input.txt 2 Hello I I am am happy` [download] Update: My first solution is basically a one-liner version of Marshall and LanX's solutions, and my second solution is quite similar to BrowserUk's solution. Also switched from using `print "@...\n"` to setting $, and $\. Hope this helps, -- Hauke D	[reply] [d/l] [select]
Re: Create n-grams from tokenized text file (multiple n-grams in parallel) by LanX (Saint) on May 15, 2016 at 12:41 UTC
Here a variation of my former code which allows to calculate a range of n-grams simultaneously. Please note that instead of printing `"$n:"` you could also `push` to an @array or `print` to a filehandle. The filehandle or arrayref or whatever target stream could be held in a hash `$ngrams{$n}` .... ... this should be among the fastest solutions š Also it's easy to change this from range to list of n-grams, just keep `$n_max` right. HTH `use strict; use warnings; my @cache; my $n_min=2; my $n_max=6; while (my $line =<DATA>) { chomp $line; push @cache, $line; for my $n ($n_min..$n_max) { print "$n: @cache[-$n .. -1]\n" # last n of cache if @cache >= $n; } shift @cache if @cache == $n_max; # keep cache at max size } __DATA__ One Two Three Four Five Six Seven Eight Nine Ten` [download] out 2: One Two 2: Two Three 3: One Two Three 2: Three Four 3: Two Three Four 4: One Two Three Four 2: Four Five 3: Three Four Five 4: Two Three Four Five 5: One Two Three Four Five 2: Five Six 3: Four Five Six 4: Three Four Five Six 5: Two Three Four Five Six 6: One Two Three Four Five Six 2: Six Seven 3: Five Six Seven 4: Four Five Six Seven 5: Three Four Five Six Seven 6: Two Three Four Five Six Seven 2: Seven Eight 3: Six Seven Eight 4: Five Six Seven Eight 5: Four Five Six Seven Eight 6: Three Four Five Six Seven Eight 2: Eight Nine 3: Seven Eight Nine 4: Six Seven Eight Nine 5: Five Six Seven Eight Nine 6: Four Five Six Seven Eight Nine 2: Nine Ten 3: Eight Nine Ten 4: Seven Eight Nine Ten 5: Six Seven Eight Nine Ten 6: Five Six Seven Eight Nine Ten [download] Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!} š) well you could have a sliding window and a regex to be even faster ;) update simplified range code from `@cache-$n ..$#cache` to `-$n .. -1`	[reply] [d/l] [select]
Re: Create n-grams from tokenized text file by LanX (Saint) on May 15, 2016 at 12:19 UTC
I don't really understand your code ... here a variation which avoids reading the whole file in once, but instead line by line `use strict; use warnings; my @cache; my $n=4; while (my $line =<DATA>) { chomp $line; push @cache, $line; if (@cache >= $n) { print "@cache\n"; shift @cache; } } __DATA__ One Two Three Four Five Six Seven Eight Nine Ten` [download] (Marshall++ was faster) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l]
Re: Create n-grams from tokenized text file by johngg (Canon) on May 15, 2016 at 15:15 UTC
Another way, reading all lines into an array and chomping them then printing the join of the shifted first word and a slice of however many subsequent words are required. `$ perl -Mstrict -Mwarnings -E ' open my $inFH, q{<}, \ <<EOF or die $!; The quick brown fox jumps over the lazy dog EOF chomp( my @words = <$inFH> ); close $inFH or die $!; my $n = shift \|\| 2; say join q{ }, shift( @words ), @words[ 0 .. $n - 2 ] while scalar @words >= $n;' The quick quick brown brown fox fox jumps jumps over over the the lazy lazy dog $` [download] `$ perl -Mstrict -Mwarnings -E ' open my $inFH, q{<}, \ <<EOF or die $!; The quick brown fox jumps over the lazy dog EOF chomp( my @words = <$inFH> ); close $inFH or die $!; my $n = shift \|\| 2; say join q{ }, shift( @words ), @words[ 0 .. $n - 2 ] while scalar @words >= $n;' 4 The quick brown fox quick brown fox jumps brown fox jumps over fox jumps over the jumps over the lazy over the lazy dog $` [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l] [select]
Re: Create n-grams from tokenized text file by Anonymous Monk on May 15, 2016 at 13:03 UTC
Monks, thank you so much! All ideas are good and miles better than mine!	[reply]

out

update