Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
From a file containing a word every line, I need to create n-grams of 2 (nad 3 and 4).
Input is
Hello I am happy
Output should be for n-grams of 2
Hello I I am am happy
I'm using the following code (this is for n-grams of 2), which works fine exept a) it doesn't look nice b) is quite slow for big input files (expecially for n-grams of 4)
my $input= "input.txt"; my $output = "output.txt"; open (OUT, ">$output") || (die "Can't open $output\n"); open (IN,"$input") || (die "WARNING: $input not found\n"); my $line = <IN>; while($line){ chomp $line; push(@token, $line); $line = <IN>; } my $shift=2; my $counter=0; my $TokenTotal = @token; while($counter <= ($TokenTotal-$shift)){ print OUT "$token[$counter]\t$token[$counter+1]\n"; $counter++; $line = <IN>; } close IN; close OUT; }
I am aware there is a Text::ngrams available, but I'd like to have a solution which I can perfectly control. Any suggestion and improvement is appreciated.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Create n-grams from tokenized text file
by Marshall (Canon) on May 15, 2016 at 12:07 UTC | |
|
Re: Create n-grams from tokenized text file
by BrowserUk (Patriarch) on May 15, 2016 at 11:21 UTC | |
|
Re: Create n-grams from tokenized text file
by haukex (Archbishop) on May 15, 2016 at 12:21 UTC | |
|
Re: Create n-grams from tokenized text file (multiple n-grams in parallel)
by LanX (Saint) on May 15, 2016 at 12:41 UTC | |
|
Re: Create n-grams from tokenized text file
by LanX (Saint) on May 15, 2016 at 12:19 UTC | |
|
Re: Create n-grams from tokenized text file
by johngg (Canon) on May 15, 2016 at 15:15 UTC | |
|
Re: Create n-grams from tokenized text file
by Anonymous Monk on May 15, 2016 at 13:03 UTC |