I see a few unnecessary arrays in the code (%seen, @total and @unique just duplicate the information that is available from %count), and there is a problem with this method of handling the stop word list:
$stops = join( '', @noise ); # should use ' ' (space) on the join ... if ( $stops !~ /$_/i) { # regex should be /\b$_\b/ ...
By paying no attention to word boundaries in the stop list, words like "fan, sour, ill, heir, tall" will be excluded from tabulation, even though they weren't "listed" in @noise. (This is a nit-pick, but if someone tries to "enhance" the stop word list, the problem could get worse.)

As for the "time domain" issue, that's rather slippery: is there any a priori (or even "empirical") notion of what an appropriate time window would be, or what sort of sampling rate is needed (words per day, per week, ...)? To handle this sort of thing, you would presumably use a hash of, say, two-element arrays keyed by word, where one element is the word count, and the other is a time-varying weight, which decreases by some sort of log factor during each sampling interval where the given word does not occur, and reset to 1 (or allowed to increment above 1) when it does occur in the current sample. The "currency" or "burst-ness" of certain vocabulary terms might then be a function of the word count and weighting factor. (There may be a need to dynamically adjust the stop-word list as well, or perhaps adjust word weights relative to some background model of "generic" word frequencies.)

The possible relevance of such tabulations to pinpointing terrorist discourse is (understatement:) perhaps remote.

Basing an example on the set of "State of the Union Addresses" is probably not going to help sell the concept... I would imagine that this corpus has too many anomalous properties when compared to other forms of discourse.


In reply to Re: Conversation Pools by graff
in thread Conversation Pools by astrobio

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.