in reply to Re: Tracking popularity of Perl discussion topics
in thread Tracking popularity of Perl discussion topics
Thanks a lot for your outline. It looks like a quick and efficient way of categorizing threads. I would like to add some linguistic knowledge to your algorithm, though. :)
The more I think about how exactly I would do an implementation, the less I like the idea of just knocking off the most common words, since they can be important in combination with other words. If we want to have big endian, little endian, and the topic of dealing with big or large files, we have have to reorganize the I think that maybe getting rid of what linguists call "functional categories" such as determiners (the, a, these) entirely and leaving quantifiers (some, many, all) up to the search engine, we might be able to retain those common words that do play some role in defining a topic. I think lexical categories like nouns, verbs, adjectives/adverbs (but not prepositions) are the way to go.
I think what you call "regard" here should be (XP_MAX_REPLY + XP_MIN_REPLY) * 0.5, but I'd have to examine how XP is really distributed across nodes in a thread before going further. Plus, this little formula doesn't add or substract XP significance according to where the node is nested. (I think it would be a good idea to count replies to replies, maybe down to the third level of nesting. After that, the topic value tends to be either too specific to just a couple of the posters personal usage, or simply irrelevant to the original question.
--
Allolex
|
|---|