in reply to Conversation Pools

Perhaps more capable of being a simple stream editor, so then can specialize filters to run against the resulting histogram in the time domain?

This way, even the stop list just becomes another kind of high-frequency pass filter, along with foreign or buzzword filters

$threshold=100; # max-cutoff while(<>) { chomp; s/[^A-z\s]//g; s/\s+/ /g; $says=$_; @words=split(/ /,$says); foreach(@words) { if($seen{$_} !=1) { $seen{$_} =1;} else {$count{$_}=$count{$_}+1;} } } while (($unique,$cases)=each(%seen)) { if($count{$unique} > $threshold){ print " $count{$unique} : $unique \n"; } }
While the time domain is statistically challenging, the text parsing seems so ripe in perl.