Conversation Pools

astrobio has asked for the wisdom of the Perl Monks concerning the following question:

Based on today's Cornell University release on what might detect histograms of frequently used terrorist terms (among other word scans), what is a perl way that speeds up this code?

The source data is the US Presidential State of the Union addresses. The goal is to find 'bursts' of frequently used terms - a dynamic histogram - to cluster around as new words get introduced into the conversation.

The Cornell researcher proposes this as a method of searching within blogs or usenet style conversations, as hot topics get brought into circulation.

#!/usr/bin/perl
# Find most frequent signals amidst political noise

$|=1;
@noise=qq/ the of and to in a that for 
      be is our by it which as this with
      have we has i will are 
      on been not their from at all an its or was but
      should they these such can upon other so them
      may any made must than there were under those 
      who if only us his my most had into every 
      some between during shall when own more 
      would you without many also over before 
      well what while through both within 
      being your could about each where still
      among after since further /;
       
$stops=join('',@noise);

# single file of State of Union addresses       
open(IN,"<soufile.txt");


foreach(<IN>){
    chomp;
      # clean left-overs
    s/^\s+|\s+$//g;
    s/[^A-z\s]//g;
    s/&(.*?);//g;
    s/\[|\]//g;
    s/\_//g;
    s/\`//g;
    s/\\//g;
    s/\s+/ /g;
    
    @words=split(/ /,$_);
    
    foreach(@words){
            $word=lc($_);
        push(@total,$word);
        
        if($seen{$word} !=1){
            push(@unique,$word);
            $seen{$word} =1;    
        } else {
            $count{$word}=$count{$word}+1;
        }
    }
}

@sorted=sort {$a cmp $b} @unique;
$total=@total;

foreach(@sorted){
    chomp;
    s/^\s+|\s+$//g;
    $percent=100 * $count{$_} / $total;
    $percent=substr($percent,0,4);
    $counts="$count{$_}  : $_ ($percent) \% ";
    if($stops !~ /$_/i){
        push(@freq,$counts);
    }
}
@histogram=sort {$b <=> $a} @freq;
for($j=0;$j<100;$j++){
    print "$histogram[$j]\n";
}
close IN;
[download]

In particular, this doesn't sort by time domains, to do a true 'burst' analysis as new words enter the conversation pool. There are likely CPAN modules that shorten (Text::Stem; Text::Scan, Text::Document, etc) or harden (Spam Assassin) the code.

So wondering about interesting heuristics, text count variants and performance.

The top ten terms overall would be:
count : word (%)
6174  : government (0.39) % 
5564  : states (0.35) % 
4524  : congress (0.29) % 
4247  : united (0.27) % 
3639  : year (0.23) % 
3379  : people (0.21) % 
2845  : great (0.18) % 
2806  : country (0.18) % 
2754  : now (0.17) % 
2703  : public (0.17) %
[download]

Comment on Conversation Pools Select or Download Code

Replies are listed 'Best First'.
Re: Conversation Pools by graff (Chancellor) on Feb 21, 2003 at 04:12 UTC
I see a few unnecessary arrays in the code (%seen, @total and @unique just duplicate the information that is available from %count), and there is a problem with this method of handling the stop word list: `$stops = join( '', @noise ); # should use ' ' (space) on the join ... if ( $stops !~ /$_/i) { # regex should be /\b$_\b/ ...` [download] By paying no attention to word boundaries in the stop list, words like "fan, sour, ill, heir, tall" will be excluded from tabulation, even though they weren't "listed" in @noise. (This is a nit-pick, but if someone tries to "enhance" the stop word list, the problem could get worse.) As for the "time domain" issue, that's rather slippery: is there any a priori (or even "empirical") notion of what an appropriate time window would be, or what sort of sampling rate is needed (words per day, per week, ...)? To handle this sort of thing, you would presumably use a hash of, say, two-element arrays keyed by word, where one element is the word count, and the other is a time-varying weight, which decreases by some sort of log factor during each sampling interval where the given word does not occur, and reset to 1 (or allowed to increment above 1) when it does occur in the current sample. The "currency" or "burst-ness" of certain vocabulary terms might then be a function of the word count and weighting factor. (There may be a need to dynamically adjust the stop-word list as well, or perhaps adjust word weights relative to some background model of "generic" word frequencies.) The possible relevance of such tabulations to pinpointing terrorist discourse is (understatement:) perhaps remote. Basing an example on the set of "State of the Union Addresses" is probably not going to help sell the concept... I would imagine that this corpus has too many anomalous properties when compared to other forms of discourse.	[reply] [d/l]
Re: Conversation Pools by Popcorn Dave (Abbot) on Feb 21, 2003 at 04:15 UTC
Interesting idea, but as I seem to recall, those that were opposed to Echelon(?) were very concious about sprinkling "detectable" words in their electronic and verbal communiques for the very purpose of messing with the system. ++ for you, but to me it looks like someone trying to justify grant monies. Just my opinion. Others may differ. There is no emoticon for what I'm feeling now.	[reply]
Re: Conversation Pools by allolex (Curate) on Feb 21, 2003 at 09:48 UTC
Real-time discourse analysis using relative frequencies might be an interesting feature to add, but hard to implement (I think). But really your main goal of viewing clusters around frequently used terms would really benefit from an extra level of linguistic abstraction: defining topic groups (e.g. word fields) and semantic/ontological domain marking. It is more interesting to see what themes occur than what specific words. If you had an ontology, you could see a broader picture. The topic groups feature would be easier to define. You just need to define fields of related words (e.g. nation, people, folks, public; terrorist, evildoer, enemy). That would be pretty cool. -- Allolex	[reply]
Re: Conversation Pools by osama (Scribe) on Feb 21, 2003 at 22:14 UTC
Here's my version of your code Read more... (2 kB)	[reply] [d/l] [select]
Re: Conversation Pools by astrobio (Beadle) on Feb 21, 2003 at 22:01 UTC
Perhaps more capable of being a simple stream editor, so then can specialize filters to run against the resulting histogram in the time domain? This way, even the stop list just becomes another kind of high-frequency pass filter, along with foreign or buzzword filters `$threshold=100; # max-cutoff while(<>) { chomp; s/[^A-z\s]//g; s/\s+/ /g; $says=$_; @words=split(/ /,$says); foreach(@words) { if($seen{$_} !=1) { $seen{$_} =1;} else {$count{$_}=$count{$_}+1;} } } while (($unique,$cases)=each(%seen)) { if($count{$unique} > $threshold){ print " $count{$unique} : $unique \n"; } }` [download] While the time domain is statistically challenging, the text parsing seems so ripe in perl.	[reply] [d/l]