in reply to Tracking popularity of Perl discussion topics

Take a list of the N most common words in the English language. This is your "useless words" list. Also create an empty "useless phrases" list.

Take the text of a given node. Count occurrences of words that are not useless. Also count occurrences of any consecutive runs of not-useless words, such as "and password authentication is".

Do this for a few hundred scattered nodes, and add some of the less useful words and phrases found to the useless lists. This step is optional but you'll weed a lot of chaff early if you do it.

# this node minus chaff 7 : words 6 : useless 4 : nodes 4 : phrases 3 : regard 3 : replies 3 : reputation 3 : useful ... 1 : are highly regarded 1 : password authentication 1 : useful words 1 : phrases found depending 1 : empty useless phrases ...

Define what "regard" is: XP/reputation? Number of replies? Replies by saints? Front-paged? It's up to you to decide what is important.

Now you can start automating. Read nodes. Assign incremental regard to the most prevalent useful words and phrases found, depending on reputation or number of replies. Associate useful phrases with sets of nodes that are highly regarded.

--
[ e d @ h a l l e y . c c ]

Replies are listed 'Best First'.
Re: Re: Tracking popularity of Perl discussion topics
by allolex (Curate) on Sep 10, 2003 at 06:24 UTC

    Thanks a lot for your outline. It looks like a quick and efficient way of categorizing threads. I would like to add some linguistic knowledge to your algorithm, though. :)

    The more I think about how exactly I would do an implementation, the less I like the idea of just knocking off the most common words, since they can be important in combination with other words. If we want to have big endian, little endian, and the topic of dealing with big or large files, we have have to reorganize the I think that maybe getting rid of what linguists call "functional categories" such as determiners (the, a, these) entirely and leaving quantifiers (some, many, all) up to the search engine, we might be able to retain those common words that do play some role in defining a topic. I think lexical categories like nouns, verbs, adjectives/adverbs (but not prepositions) are the way to go.

    I think what you call "regard" here should be (XP_MAX_REPLY + XP_MIN_REPLY) * 0.5, but I'd have to examine how XP is really distributed across nodes in a thread before going further. Plus, this little formula doesn't add or substract XP significance according to where the node is nested. (I think it would be a good idea to count replies to replies, maybe down to the third level of nesting. After that, the topic value tends to be either too specific to just a couple of the posters personal usage, or simply irrelevant to the original question.

    --
    Allolex