By paying no attention to word boundaries in the stop list, words like "fan, sour, ill, heir, tall" will be excluded from tabulation, even though they weren't "listed" in @noise. (This is a nit-pick, but if someone tries to "enhance" the stop word list, the problem could get worse.)$stops = join( '', @noise ); # should use ' ' (space) on the join ... if ( $stops !~ /$_/i) { # regex should be /\b$_\b/ ...
As for the "time domain" issue, that's rather slippery: is there any a priori (or even "empirical") notion of what an appropriate time window would be, or what sort of sampling rate is needed (words per day, per week, ...)? To handle this sort of thing, you would presumably use a hash of, say, two-element arrays keyed by word, where one element is the word count, and the other is a time-varying weight, which decreases by some sort of log factor during each sampling interval where the given word does not occur, and reset to 1 (or allowed to increment above 1) when it does occur in the current sample. The "currency" or "burst-ness" of certain vocabulary terms might then be a function of the word count and weighting factor. (There may be a need to dynamically adjust the stop-word list as well, or perhaps adjust word weights relative to some background model of "generic" word frequencies.)
The possible relevance of such tabulations to pinpointing terrorist discourse is (understatement:) perhaps remote.
Basing an example on the set of "State of the Union Addresses" is probably not going to help sell the concept... I would imagine that this corpus has too many anomalous properties when compared to other forms of discourse.
In reply to Re: Conversation Pools
by graff
in thread Conversation Pools
by astrobio
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |