Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

From a file containing a word every line, I need to create n-grams of 2 (nad 3 and 4).

Input is

Hello I am happy

Output should be for n-grams of 2

Hello I I am am happy

I'm using the following code (this is for n-grams of 2), which works fine exept a) it doesn't look nice b) is quite slow for big input files (expecially for n-grams of 4)

my $input= "input.txt"; my $output = "output.txt"; open (OUT, ">$output") || (die "Can't open $output\n"); open (IN,"$input") || (die "WARNING: $input not found\n"); my $line = <IN>; while($line){ chomp $line; push(@token, $line); $line = <IN>; } my $shift=2; my $counter=0; my $TokenTotal = @token; while($counter <= ($TokenTotal-$shift)){ print OUT "$token[$counter]\t$token[$counter+1]\n"; $counter++; $line = <IN>; } close IN; close OUT; }

I am aware there is a Text::ngrams available, but I'd like to have a solution which I can perfectly control. Any suggestion and improvement is appreciated.

Replies are listed 'Best First'.
Re: Create n-grams from tokenized text file
by Marshall (Canon) on May 15, 2016 at 12:07 UTC
    Another possible solution for you...Since no array of the all the "line words" are created, this scales linearly with file size.
    #!/usr/bin/perl use warnings; use strict; my @tokens; my $n=2; #adjust as needed while (my $line = <DATA>) { chomp $line; push @tokens, $line; next if @tokens <$n; print "@tokens\n"; shift @tokens; } =prints: N=2.... Hello I I am am happy happy this this is is the the text text to to play play with N=3.... Hello I am I am happy am happy this happy this is this is the is the text the text to text to play to play with =cut __DATA__ Hello I am happy this is the text to play with
Re: Create n-grams from tokenized text file
by BrowserUk (Patriarch) on May 15, 2016 at 11:21 UTC

    This assumes you can get the words into an array:

    @x = qw[ this is the text to play with ];; $n = 2; print @x[ $_ .. $_+$n-1 ] for 0 .. (@x - $n);; this is is the the text text to to play play with $n = 3; print @x[ $_ .. $_+$n-1 ] for 0 .. (@x - $n);; this is the is the text the text to text to play to play with $n = 4; print @x[ $_ .. $_+$n-1 ] for 0 .. (@x - $n);; this is the text is the text to the text to play text to play with $n = 5; print @x[ $_ .. $_+$n-1 ] for 0 .. (@x - $n);; this is the text to is the text to play the text to play with

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Create n-grams from tokenized text file
by haukex (Archbishop) on May 15, 2016 at 12:21 UTC

    Hi Anonymous,

    You're reading your entire file into memory, which isn't necessary because it looks like all you want is a sliding window over a certain number of lines from the file. This can be done by pushing each element onto the end of an array and then shifting elements off the beginning. This can even be done in a one-liner - see perl's -n and -l switches in perlrun:

    $ perl -wnle 'push @x, $_; if(@x>=2) {print "@x"; shift @x}' input.txt Hello I I am am happy $ perl -wnle 'push @x, $_; if(@x>=3) {print "@x"; shift @x}' input.txt Hello I am I am happy

    Another solution might be the core module Tie::File. Note I haven't tested this for efficiency, but since the module doesn't read the entire file into memory this solution should still be better than doing that. The module does have to scan the file once to get the number of records contained within:

    #!/usr/bin/env perl use warnings; use strict; use Tie::File; use Fcntl 'O_RDONLY'; die "Usage: $0 INPUTFILE WINSIZE\n" unless @ARGV==2; my ($INPUTFILE,$WINSIZE) = @ARGV; tie my @array, 'Tie::File', $INPUTFILE, mode => O_RDONLY; $, = " "; $\ = "\n"; # output field/record separators for my $i (0..@array-$WINSIZE) { print @array[$i..$i+$WINSIZE-1]; }

    Example usage:

    $ perl window.pl input.txt 2 Hello I I am am happy

    Update: My first solution is basically a one-liner version of Marshall and LanX's solutions, and my second solution is quite similar to BrowserUk's solution. Also switched from using print "@...\n" to setting $, and $\.

    Hope this helps,
    -- Hauke D

Re: Create n-grams from tokenized text file (multiple n-grams in parallel)
by LanX (Saint) on May 15, 2016 at 12:41 UTC
    Here a variation of my former code which allows to calculate a range of n-grams simultaneously.

    Please note that instead of printing "$n:" you could also push to an @array or print to a filehandle.

    The filehandle or arrayref or whatever target stream could be held in a hash $ngrams{$n} ....

    ... this should be among the fastest solutions ¹

    Also it's easy to change this from range to list of n-grams, just keep $n_max right.

    HTH

    use strict; use warnings; my @cache; my $n_min=2; my $n_max=6; while (my $line =<DATA>) { chomp $line; push @cache, $line; for my $n ($n_min..$n_max) { print "$n: @cache[-$n .. -1]\n" # last n of cache if @cache >= $n; } shift @cache if @cache == $n_max; # keep cache at max size } __DATA__ One Two Three Four Five Six Seven Eight Nine Ten

    out
    2: One Two 2: Two Three 3: One Two Three 2: Three Four 3: Two Three Four 4: One Two Three Four 2: Four Five 3: Three Four Five 4: Two Three Four Five 5: One Two Three Four Five 2: Five Six 3: Four Five Six 4: Three Four Five Six 5: Two Three Four Five Six 6: One Two Three Four Five Six 2: Six Seven 3: Five Six Seven 4: Four Five Six Seven 5: Three Four Five Six Seven 6: Two Three Four Five Six Seven 2: Seven Eight 3: Six Seven Eight 4: Five Six Seven Eight 5: Four Five Six Seven Eight 6: Three Four Five Six Seven Eight 2: Eight Nine 3: Seven Eight Nine 4: Six Seven Eight Nine 5: Five Six Seven Eight Nine 6: Four Five Six Seven Eight Nine 2: Nine Ten 3: Eight Nine Ten 4: Seven Eight Nine Ten 5: Six Seven Eight Nine Ten 6: Five Six Seven Eight Nine Ten

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

    ¹) well you could have a sliding window and a regex to be even faster ;)

    update

    simplified range code from @cache-$n ..$#cache to -$n .. -1

Re: Create n-grams from tokenized text file
by LanX (Saint) on May 15, 2016 at 12:19 UTC
    I don't really understand your code ...

    here a variation which avoids reading the whole file in once, but instead line by line

    use strict; use warnings; my @cache; my $n=4; while (my $line =<DATA>) { chomp $line; push @cache, $line; if (@cache >= $n) { print "@cache\n"; shift @cache; } } __DATA__ One Two Three Four Five Six Seven Eight Nine Ten

    (Marshall++ was faster)

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

Re: Create n-grams from tokenized text file
by johngg (Canon) on May 15, 2016 at 15:15 UTC

    Another way, reading all lines into an array and chomping them then printing the join of the shifted first word and a slice of however many subsequent words are required.

    $ perl -Mstrict -Mwarnings -E ' open my $inFH, q{<}, \ <<EOF or die $!; The quick brown fox jumps over the lazy dog EOF chomp( my @words = <$inFH> ); close $inFH or die $!; my $n = shift || 2; say join q{ }, shift( @words ), @words[ 0 .. $n - 2 ] while scalar @words >= $n;' The quick quick brown brown fox fox jumps jumps over over the the lazy lazy dog $
    $ perl -Mstrict -Mwarnings -E ' open my $inFH, q{<}, \ <<EOF or die $!; The quick brown fox jumps over the lazy dog EOF chomp( my @words = <$inFH> ); close $inFH or die $!; my $n = shift || 2; say join q{ }, shift( @words ), @words[ 0 .. $n - 2 ] while scalar @words >= $n;' 4 The quick brown fox quick brown fox jumps brown fox jumps over fox jumps over the jumps over the lazy over the lazy dog $

    I hope this is of interest.

    Cheers,

    JohnGG

Re: Create n-grams from tokenized text file
by Anonymous Monk on May 15, 2016 at 13:03 UTC

    Monks, thank you so much! All ideas are good and miles better than mine!