astrobio has asked for the wisdom of the Perl Monks concerning the following question:

Based on today's Cornell University release on what might detect histograms of frequently used terrorist terms (among other word scans), what is a perl way that speeds up this code?

The source data is the US Presidential State of the Union addresses. The goal is to find 'bursts' of frequently used terms - a dynamic histogram - to cluster around as new words get introduced into the conversation.

The Cornell researcher proposes this as a method of searching within blogs or usenet style conversations, as hot topics get brought into circulation.
#!/usr/bin/perl # Find most frequent signals amidst political noise $|=1; @noise=qq/ the of and to in a that for be is our by it which as this with have we has i will are on been not their from at all an its or was but should they these such can upon other so them may any made must than there were under those who if only us his my most had into every some between during shall when own more would you without many also over before well what while through both within being your could about each where still among after since further /; $stops=join('',@noise); # single file of State of Union addresses open(IN,"<soufile.txt"); foreach(<IN>){ chomp; # clean left-overs s/^\s+|\s+$//g; s/[^A-z\s]//g; s/&(.*?);//g; s/\[|\]//g; s/\_//g; s/\`//g; s/\\//g; s/\s+/ /g; @words=split(/ /,$_); foreach(@words){ $word=lc($_); push(@total,$word); if($seen{$word} !=1){ push(@unique,$word); $seen{$word} =1; } else { $count{$word}=$count{$word}+1; } } } @sorted=sort {$a cmp $b} @unique; $total=@total; foreach(@sorted){ chomp; s/^\s+|\s+$//g; $percent=100 * $count{$_} / $total; $percent=substr($percent,0,4); $counts="$count{$_} : $_ ($percent) \% "; if($stops !~ /$_/i){ push(@freq,$counts); } } @histogram=sort {$b <=> $a} @freq; for($j=0;$j<100;$j++){ print "$histogram[$j]\n"; } close IN;
In particular, this doesn't sort by time domains, to do a true 'burst' analysis as new words enter the conversation pool. There are likely CPAN modules that shorten (Text::Stem; Text::Scan, Text::Document, etc) or harden (Spam Assassin) the code.

So wondering about interesting heuristics, text count variants and performance.
The top ten terms overall would be: count : word (%) 6174 : government (0.39) % 5564 : states (0.35) % 4524 : congress (0.29) % 4247 : united (0.27) % 3639 : year (0.23) % 3379 : people (0.21) % 2845 : great (0.18) % 2806 : country (0.18) % 2754 : now (0.17) % 2703 : public (0.17) %

Replies are listed 'Best First'.
Re: Conversation Pools
by graff (Chancellor) on Feb 21, 2003 at 04:12 UTC
    I see a few unnecessary arrays in the code (%seen, @total and @unique just duplicate the information that is available from %count), and there is a problem with this method of handling the stop word list:
    $stops = join( '', @noise ); # should use ' ' (space) on the join ... if ( $stops !~ /$_/i) { # regex should be /\b$_\b/ ...
    By paying no attention to word boundaries in the stop list, words like "fan, sour, ill, heir, tall" will be excluded from tabulation, even though they weren't "listed" in @noise. (This is a nit-pick, but if someone tries to "enhance" the stop word list, the problem could get worse.)

    As for the "time domain" issue, that's rather slippery: is there any a priori (or even "empirical") notion of what an appropriate time window would be, or what sort of sampling rate is needed (words per day, per week, ...)? To handle this sort of thing, you would presumably use a hash of, say, two-element arrays keyed by word, where one element is the word count, and the other is a time-varying weight, which decreases by some sort of log factor during each sampling interval where the given word does not occur, and reset to 1 (or allowed to increment above 1) when it does occur in the current sample. The "currency" or "burst-ness" of certain vocabulary terms might then be a function of the word count and weighting factor. (There may be a need to dynamically adjust the stop-word list as well, or perhaps adjust word weights relative to some background model of "generic" word frequencies.)

    The possible relevance of such tabulations to pinpointing terrorist discourse is (understatement:) perhaps remote.

    Basing an example on the set of "State of the Union Addresses" is probably not going to help sell the concept... I would imagine that this corpus has too many anomalous properties when compared to other forms of discourse.

Re: Conversation Pools
by Popcorn Dave (Abbot) on Feb 21, 2003 at 04:15 UTC
    Interesting idea, but as I seem to recall, those that were opposed to Echelon(?) were very concious about sprinkling "detectable" words in their electronic and verbal communiques for the very purpose of messing with the system.

    ++ for you, but to me it looks like someone trying to justify grant monies. Just my opinion. Others may differ.

    There is no emoticon for what I'm feeling now.

Re: Conversation Pools
by allolex (Curate) on Feb 21, 2003 at 09:48 UTC

    Real-time discourse analysis using relative frequencies might be an interesting feature to add, but hard to implement (I think).

    But really your main goal of viewing clusters around frequently used terms would really benefit from an extra level of linguistic abstraction: defining topic groups (e.g. word fields) and semantic/ontological domain marking. It is more interesting to see what themes occur than what specific words. If you had an ontology, you could see a broader picture.

    The topic groups feature would be easier to define. You just need to define fields of related words (e.g. nation, people, folks, public; terrorist, evildoer, enemy). That would be pretty cool.

    --
    Allolex

Re: Conversation Pools
by osama (Scribe) on Feb 21, 2003 at 22:14 UTC
Re: Conversation Pools
by astrobio (Beadle) on Feb 21, 2003 at 22:01 UTC
    Perhaps more capable of being a simple stream editor, so then can specialize filters to run against the resulting histogram in the time domain?

    This way, even the stop list just becomes another kind of high-frequency pass filter, along with foreign or buzzword filters

    $threshold=100; # max-cutoff while(<>) { chomp; s/[^A-z\s]//g; s/\s+/ /g; $says=$_; @words=split(/ /,$says); foreach(@words) { if($seen{$_} !=1) { $seen{$_} =1;} else {$count{$_}=$count{$_}+1;} } } while (($unique,$cases)=each(%seen)) { if($count{$unique} > $threshold){ print " $count{$unique} : $unique \n"; } }
    While the time domain is statistically challenging, the text parsing seems so ripe in perl.