Re: Tracking popularity of Perl discussion topics
by hossman (Prior) on Sep 08, 2003 at 22:20 UTC
|
The root of your problem would be in identifying what the topics of a given thread/node are.
Assuming you don't want to hire a team of helper monkeys to classify every node in to a Taxonomy, then you would need to look into some of the many attempts at classification of documents by programmitic analysis. This comes up about once a year on slashdot in the form of a "has anymore found a good way to categorize all of your email?" question ... there seem to be some decent algorithms out there for doing classification, but many of them require a predefined list of categories with sample sets.
I've heard of systems that can find common topics among large quantities of text, but i've never really looked into it in depth.
| [reply] |
|
|
I have given this one some thought, considering that it is the linguistic aspect of this whole idea. (You know me...)
How about creating an ontology based on the 'core' vocabulary of the highest-rated nodes of a hand-marked thread? For example, we have a question or meditation that we mark as 'security', 'password', 'login'. Then we take the nodes from a median score upward (because they are most likely to be relevant to the topic) and extracting their vocabulary, storing it in a keyword list (with verbs, nouns, adjectives) that represents the junction of the topics mentioned above. That would be a quick and dirty way of defining what members belong to an topic category.
These could be split up later and put into XML topic maps which, by virtue of their structure, would allow topic clustering on a much larger scale.
--
Allolex
| [reply] |
Re: Tracking popularity of Perl discussion topics
by dmitri (Priest) on Sep 08, 2003 at 22:33 UTC
|
Programmatically going through all perlmonks.org threads and classifying them, and storing them in a table is one thing, but outside the site itself will be of marginal use versus effort.
Consider these issues:
- Treads' popularity changes. Will you regenerate your table every day? week? month?
- Will you be the only one using the table? If you publish it on-line with links to threads, your page in effect becomes part of perlmonks.org.
- Is it a good idea to put more load on an already very loaded server? We're nearing 300K nodes.
Maybe a proposal about an extra thread indexing scheme could be made to site maintainters. If the proposal finds enough support, the change might be implemented.
Myself, I am at least curious about what the top-rated threads are.
| [reply] |
|
|
You bring up some very interesting problems and a good suggestion.
At the risk of stating the obvious, any solution creating more than a little server load would have to be tossed out. That means an external server and cooperation with the Monastery developers.
The popularity changes issue is a feature. It is possible to give exactly those statistics you mentioned, daily (within 24 hours of posting), weekly, and monthly. I would be interested to know when the average falloff for nodes is, i.e. when the votes stop trickling and being a slow drip. You could then generate something like "these are the thread topics that keep coming up" or "these topics were mentioned once and never again", along with the popularity of the threads.
I think the setup would have to be a bit like jcwren's Perl Monks Statistics site, a part of the Monastery. I don't see where metainformation about the site could be anything other than an extension of it. :)
--
Allolex
| [reply] |
Re: Tracking popularity of Perl discussion topics
by valdez (Monsignor) on Sep 09, 2003 at 01:43 UTC
|
| [reply] |
Re: Tracking popularity of Perl discussion topics
by dmitri (Priest) on Sep 08, 2003 at 22:06 UTC
|
Maybe a formula can be used, something like this:
sigma(nodes in thread){reputation/(days old) * some_weight}
It seems that it would not be very difficult to periodically update thread heads with this "popularity" index.
Update: hmm. I guess I did suggest changing the site itself. My bad.
| [reply] |
|
|
I think that allolex was suggesting something beyond single threads. You would either have to broadly categorize threads manually, or else use some kind of automated criteria (eg. keywords).
For instance, one might find that, over the last year all threads dealing with CGI forms averaged X. To obtain X you could use your suggested calculation averaged across the number of threads, and perhaps weighted a little towards topics with more threads. I don't know, I'm out of my league on the math end of it.
The sticky bit is trying to categorize threads without human intervention.
</ajdelore>
| [reply] |
Re: Tracking popularity of Perl discussion topics
by kleucht (Beadle) on Sep 09, 2003 at 00:36 UTC
|
| [reply] |
|
|
IMHO, if your primary interest is in skimming the cream of the crop, so to speak, you should be looking in the following places (and, in fact, probably don't need Yet Another Tool in your kit).
- Daily Best. The Daily Best nodelet can be enabled from the user preferences section on your home node. These are the nodes with the highest reputations for the day. Perhaps reputation is a more worthy indicator of merit than popularity.
- Weekly best. The Weekly Best nodelet can be enabled through the user preferences section on your home node. These are the nodes that have become of highest repute for the week. You'll definately find some choice topics here.
- Update: The Best Nodes page gives a summary of Daily best, Weekly best, and Alltime Best nodes. Thanks Jeffa for the reminder.
- Front-paged nodes. These are the ones that show up when you first enter the Monastary, or when you click on The Monastary Gates section. Front-page nodes were front-paged by those who have obtained a certain level of positive experience in the monastery, and generally are nodes that ask good questions and give good answers.
- The topic section that most interests you today.
- The Q&A, and the Tutorials sections.
With those starting points you will find the best of the best posts, without having to sift very long. Popularity would only tell you how many people clicked on a node. Reputation will tell you roughly what people thought of it.
Dave
"If I had my life to do over again, I'd be a plumber." -- Albert Einstein
| [reply] |
Re: Tracking popularity of Perl discussion topics
by halley (Prior) on Sep 09, 2003 at 13:16 UTC
|
Take a list of the N most common words in the English language. This is your "useless words" list. Also create an empty "useless phrases" list.
Take the text of a given node. Count occurrences of words that are not useless. Also count occurrences of any consecutive runs of not-useless words, such as "and password authentication is".
Do this for a few hundred scattered nodes, and add some of the less useful words and phrases found to the useless lists. This step is optional but you'll weed a lot of chaff early if you do it.
# this node minus chaff
7 : words
6 : useless
4 : nodes
4 : phrases
3 : regard
3 : replies
3 : reputation
3 : useful
...
1 : are highly regarded
1 : password authentication
1 : useful words
1 : phrases found depending
1 : empty useless phrases
...
Define what "regard" is: XP/reputation? Number of replies? Replies by saints? Front-paged? It's up to you to decide what is important.
Now you can start automating. Read nodes. Assign incremental regard to the most prevalent useful words and phrases found, depending on reputation or number of replies. Associate useful phrases with sets of nodes that are highly regarded.
-- [ e d @ h a l l e y . c c ] | [reply] [d/l] |
|
|
Thanks a lot for your outline. It looks like a quick and efficient way of categorizing threads. I would like to add some linguistic knowledge to your algorithm, though. :)
The more I think about how exactly I would do an implementation, the less I like the idea of just knocking off the most common words, since they can be important in combination with other words. If we want to have big endian, little endian, and the topic of dealing with big or large files, we have have to reorganize the I think that maybe getting rid of what linguists call "functional categories" such as determiners (the, a, these) entirely and leaving quantifiers (some, many, all) up to the search engine, we might be able to retain those common words that do play some role in defining a topic. I think lexical categories like nouns, verbs, adjectives/adverbs (but not prepositions) are the way to go.
I think what you call "regard" here should be (XP_MAX_REPLY + XP_MIN_REPLY) * 0.5, but I'd have to examine how XP is really distributed across nodes in a thread before going further. Plus, this little formula doesn't add or substract XP significance according to where the node is nested. (I think it would be a good idea to count replies to replies, maybe down to the third level of nesting. After that, the topic value tends to be either too specific to just a couple of the posters personal usage, or simply irrelevant to the original question.
--
Allolex
| [reply] |
Re: Tracking popularity of Perl discussion topics
by calin (Deacon) on Sep 09, 2003 at 18:21 UTC
|
It happened that I did post my first node here at the Monastery in the
same thread, and the exceptionally high rep my post got (and other
replies too) made me think and sleep over it, and I belive I've come
come to some explanation.
Yes, the topic played it's part, but I think it's not entirely
responsible for the high reps. For that, all the stars had to be
properly aligned. Monks may not be aware of the fact that
the story was not only frontpaged, but it stayed on top of the
front page for more than one day! With the favourable
time, space and topic combined, the
story had sizable exposure (best_seat x long_time x lure_factor).
When monks see that a question has many replies, they infer that
there's lively discussion going on, and they click. There were many
replies in the beginning (there was no definitive solution, some
proposed solutions were flawed, the possibility of DoS attacks was
pointed out etc.) This, together with the top exposure created a
positive feedback environment that brought in more visits and more
replies (and more votes).
I could elaborate on this but I'm afraid of giving birth to Yet
Another Crackpot Theory (YACT) (tm).
| [reply] [d/l] |