comment on

Based on today's Cornell University release on what might detect histograms of frequently used terrorist terms (among other word scans), what is a perl way that speeds up this code?

The source data is the US Presidential State of the Union addresses. The goal is to find 'bursts' of frequently used terms - a dynamic histogram - to cluster around as new words get introduced into the conversation.

The Cornell researcher proposes this as a method of searching within blogs or usenet style conversations, as hot topics get brought into circulation.

#!/usr/bin/perl
# Find most frequent signals amidst political noise

$|=1;
@noise=qq/ the of and to in a that for 
      be is our by it which as this with
      have we has i will are 
      on been not their from at all an its or was but
      should they these such can upon other so them
      may any made must than there were under those 
      who if only us his my most had into every 
      some between during shall when own more 
      would you without many also over before 
      well what while through both within 
      being your could about each where still
      among after since further /;
       
$stops=join('',@noise);

# single file of State of Union addresses       
open(IN,"<soufile.txt");


foreach(<IN>){
    chomp;
      # clean left-overs
    s/^\s+|\s+$//g;
    s/[^A-z\s]//g;
    s/&(.*?);//g;
    s/\[|\]//g;
    s/\_//g;
    s/\`//g;
    s/\\//g;
    s/\s+/ /g;
    
    @words=split(/ /,$_);
    
    foreach(@words){
            $word=lc($_);
        push(@total,$word);
        
        if($seen{$word} !=1){
            push(@unique,$word);
            $seen{$word} =1;    
        } else {
            $count{$word}=$count{$word}+1;
        }
    }
}

@sorted=sort {$a cmp $b} @unique;
$total=@total;

foreach(@sorted){
    chomp;
    s/^\s+|\s+$//g;
    $percent=100 * $count{$_} / $total;
    $percent=substr($percent,0,4);
    $counts="$count{$_}  : $_ ($percent) \% ";
    if($stops !~ /$_/i){
        push(@freq,$counts);
    }
}
@histogram=sort {$b <=> $a} @freq;
for($j=0;$j<100;$j++){
    print "$histogram[$j]\n";
}
close IN;
[download]

In particular, this doesn't sort by time domains, to do a true 'burst' analysis as new words enter the conversation pool. There are likely CPAN modules that shorten (Text::Stem; Text::Scan, Text::Document, etc) or harden (Spam Assassin) the code.

So wondering about interesting heuristics, text count variants and performance.

The top ten terms overall would be:
count : word (%)
6174  : government (0.39) % 
5564  : states (0.35) % 
4524  : congress (0.29) % 
4247  : united (0.27) % 
3639  : year (0.23) % 
3379  : people (0.21) % 
2845  : great (0.18) % 
2806  : country (0.18) % 
2754  : now (0.17) % 
2703  : public (0.17) %
[download]

In reply to Conversation Pools by astrobio

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.