From a file containing a word every line, I need to create n-grams of 2 (nad 3 and 4).
Input is
Hello I am happy
Output should be for n-grams of 2
Hello I I am am happy
I'm using the following code (this is for n-grams of 2), which works fine exept a) it doesn't look nice b) is quite slow for big input files (expecially for n-grams of 4)
my $input= "input.txt"; my $output = "output.txt"; open (OUT, ">$output") || (die "Can't open $output\n"); open (IN,"$input") || (die "WARNING: $input not found\n"); my $line = <IN>; while($line){ chomp $line; push(@token, $line); $line = <IN>; } my $shift=2; my $counter=0; my $TokenTotal = @token; while($counter <= ($TokenTotal-$shift)){ print OUT "$token[$counter]\t$token[$counter+1]\n"; $counter++; $line = <IN>; } close IN; close OUT; }
I am aware there is a Text::ngrams available, but I'd like to have a solution which I can perfectly control. Any suggestion and improvement is appreciated.
In reply to Create n-grams from tokenized text file by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |