Re: Conversation Pools

Perhaps more capable of being a simple stream editor, so then can specialize filters to run against the resulting histogram in the time domain?

This way, even the stop list just becomes another kind of high-frequency pass filter, along with foreign or buzzword filters

$threshold=100; # max-cutoff
while(<>) {
  chomp;
  s/[^A-z\s]//g;
  s/\s+/ /g;
  $says=$_;
  @words=split(/ /,$says);
  foreach(@words) {
    if($seen{$_} !=1) { $seen{$_} =1;}
    else {$count{$_}=$count{$_}+1;}
  }
}
while (($unique,$cases)=each(%seen)) {
  if($count{$unique} > $threshold){
     print " $count{$unique} : $unique \n";
  }
}
[download]

While the time domain is statistically challenging, the text parsing seems so ripe in perl.

Comment on Re: Conversation Pools Download Code