Re: Fasten up NGram generation code?

Don't know about faster, but I think this is more elegant:
++$$ngram{$_} for grep defined, "_${word}_" =~ /(?=(((((.).).).).))|(?=((((.).).).))| (?=(((.).).))|(?=((.).))|(?=(.))/gx;
[download]
Update: Perhaps I had better say, dog slow, but pretty. Thanks for the benchmark, Abigail. This version adds .'s to avoid MINMATCH and is only slightly faster.

$ngram_y2 = {};
for my $word (@large_wordlist) {
  ++ $$ngram_y2 {$_} for grep defined,
  "_${word}_" =~ / (?=(((((.).).).).)).
                |    (?=((((.).).).)) .
                |     (?=(((.).).))   .
                |      (?=((.).))     .
                |       (?=(.))       ./gx;
}
[download]

Comment on Re: Fasten up NGram generation code? Select or Download Code

Replies are listed 'Best First'.
Re: Fasten up NGram generation code? by Abigail-II (Bishop) on Jan 09, 2004 at 09:18 UTC
More elegant? I do have to disagree with that. Your regex is awkward to modify it to ngrams of different sizes. What if you want to count all substrings up to a length of 10? Or what if you want to count all substrings? Abigail	[reply]
Re: Re: Fasten up NGram generation code? by ysth (Canon) on Jan 09, 2004 at 16:41 UTC
To clarify, I meant more elegant than the OP, not than your regex.	[reply]
Re: Fasten up NGram generation code? by Abigail-II (Bishop) on Jan 09, 2004 at 20:43 UTC
Yes, I was assuming you meant that. I still find the OP's approach easier to modify to include different substring lengths. But I'm quick with cut-and-paste. Abigail	[reply]