Based on today's Cornell University release on what might detect histograms of frequently used terrorist terms (among other word scans), what is a perl way that speeds up this code?

The source data is the US Presidential State of the Union addresses. The goal is to find 'bursts' of frequently used terms - a dynamic histogram - to cluster around as new words get introduced into the conversation.

The Cornell researcher proposes this as a method of searching within blogs or usenet style conversations, as hot topics get brought into circulation.
#!/usr/bin/perl # Find most frequent signals amidst political noise $|=1; @noise=qq/ the of and to in a that for be is our by it which as this with have we has i will are on been not their from at all an its or was but should they these such can upon other so them may any made must than there were under those who if only us his my most had into every some between during shall when own more would you without many also over before well what while through both within being your could about each where still among after since further /; $stops=join('',@noise); # single file of State of Union addresses open(IN,"<soufile.txt"); foreach(<IN>){ chomp; # clean left-overs s/^\s+|\s+$//g; s/[^A-z\s]//g; s/&(.*?);//g; s/\[|\]//g; s/\_//g; s/\`//g; s/\\//g; s/\s+/ /g; @words=split(/ /,$_); foreach(@words){ $word=lc($_); push(@total,$word); if($seen{$word} !=1){ push(@unique,$word); $seen{$word} =1; } else { $count{$word}=$count{$word}+1; } } } @sorted=sort {$a cmp $b} @unique; $total=@total; foreach(@sorted){ chomp; s/^\s+|\s+$//g; $percent=100 * $count{$_} / $total; $percent=substr($percent,0,4); $counts="$count{$_} : $_ ($percent) \% "; if($stops !~ /$_/i){ push(@freq,$counts); } } @histogram=sort {$b <=> $a} @freq; for($j=0;$j<100;$j++){ print "$histogram[$j]\n"; } close IN;
In particular, this doesn't sort by time domains, to do a true 'burst' analysis as new words enter the conversation pool. There are likely CPAN modules that shorten (Text::Stem; Text::Scan, Text::Document, etc) or harden (Spam Assassin) the code.

So wondering about interesting heuristics, text count variants and performance.
The top ten terms overall would be: count : word (%) 6174 : government (0.39) % 5564 : states (0.35) % 4524 : congress (0.29) % 4247 : united (0.27) % 3639 : year (0.23) % 3379 : people (0.21) % 2845 : great (0.18) % 2806 : country (0.18) % 2754 : now (0.17) % 2703 : public (0.17) %

In reply to Conversation Pools by astrobio

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.