in reply to Fasten up NGram generation code?

Don't know about faster, but I think this is more elegant:
++$$ngram{$_} for grep defined, "_${word}_" =~ /(?=(((((.).).).).))|(?=((((.).).).))| (?=(((.).).))|(?=((.).))|(?=(.))/gx;
Update: Perhaps I had better say, dog slow, but pretty. Thanks for the benchmark, Abigail. This version adds .'s to avoid MINMATCH and is only slightly faster.
$ngram_y2 = {}; for my $word (@large_wordlist) { ++ $$ngram_y2 {$_} for grep defined, "_${word}_" =~ / (?=(((((.).).).).)). | (?=((((.).).).)) . | (?=(((.).).)) . | (?=((.).)) . | (?=(.)) ./gx; }

Replies are listed 'Best First'.
Re: Fasten up NGram generation code?
by Abigail-II (Bishop) on Jan 09, 2004 at 09:18 UTC
    More elegant? I do have to disagree with that. Your regex is awkward to modify it to ngrams of different sizes. What if you want to count all substrings up to a length of 10? Or what if you want to count all substrings?

    Abigail

      To clarify, I meant more elegant than the OP, not than your regex.
        Yes, I was assuming you meant that. I still find the OP's approach easier to modify to include different substring lengths.

        But I'm quick with cut-and-paste.

        Abigail