Today I posted something to a thread ("Re: Password hacker killer") and found that the average positive moderation on almost all the nodes was quite high--extrordinarily high. It looks like the topic of security and password authentication is very popular. As such usually usually happen, this thought led to another, and then to another...

The question I ended up with is this: How could someone go about tracking the popularity (and perhaps activity) of particular topics posted at the Monastery? I would like to stop short of actually making changes to the site itself, but am more interested in ideas along the lines of data harvesting. I have my own ideas, but I would like to hear how others would approach the problem.

Looking forward to reading your thoughts.

--
Allolex

Yes, the pun was intentional. Sorry.

  • Comment on Tracking popularity of Perl discussion topics

Replies are listed 'Best First'.
Re: Tracking popularity of Perl discussion topics
by hossman (Prior) on Sep 08, 2003 at 22:20 UTC

    The root of your problem would be in identifying what the topics of a given thread/node are.

    Assuming you don't want to hire a team of helper monkeys to classify every node in to a Taxonomy, then you would need to look into some of the many attempts at classification of documents by programmitic analysis. This comes up about once a year on slashdot in the form of a "has anymore found a good way to categorize all of your email?" question ... there seem to be some decent algorithms out there for doing classification, but many of them require a predefined list of categories with sample sets.

    I've heard of systems that can find common topics among large quantities of text, but i've never really looked into it in depth.

      I have given this one some thought, considering that it is the linguistic aspect of this whole idea. (You know me...)

      How about creating an ontology based on the 'core' vocabulary of the highest-rated nodes of a hand-marked thread? For example, we have a question or meditation that we mark as 'security', 'password', 'login'. Then we take the nodes from a median score upward (because they are most likely to be relevant to the topic) and extracting their vocabulary, storing it in a keyword list (with verbs, nouns, adjectives) that represents the junction of the topics mentioned above. That would be a quick and dirty way of defining what members belong to an topic category.

      These could be split up later and put into XML topic maps which, by virtue of their structure, would allow topic clustering on a much larger scale.

      --
      Allolex

Re: Tracking popularity of Perl discussion topics
by dmitri (Priest) on Sep 08, 2003 at 22:33 UTC
    Programmatically going through all perlmonks.org threads and classifying them, and storing them in a table is one thing, but outside the site itself will be of marginal use versus effort.

    Consider these issues:

    • Treads' popularity changes. Will you regenerate your table every day? week? month?
    • Will you be the only one using the table? If you publish it on-line with links to threads, your page in effect becomes part of perlmonks.org.
    • Is it a good idea to put more load on an already very loaded server? We're nearing 300K nodes.

    Maybe a proposal about an extra thread indexing scheme could be made to site maintainters. If the proposal finds enough support, the change might be implemented.

    Myself, I am at least curious about what the top-rated threads are.

      You bring up some very interesting problems and a good suggestion.

      At the risk of stating the obvious, any solution creating more than a little server load would have to be tossed out. That means an external server and cooperation with the Monastery developers.

      The popularity changes issue is a feature. It is possible to give exactly those statistics you mentioned, daily (within 24 hours of posting), weekly, and monthly. I would be interested to know when the average falloff for nodes is, i.e. when the votes stop trickling and being a slow drip. You could then generate something like "these are the thread topics that keep coming up" or "these topics were mentioned once and never again", along with the popularity of the threads.

      I think the setup would have to be a bit like jcwren's Perl Monks Statistics site, a part of the Monastery. I don't see where metainformation about the site could be anything other than an extension of it. :)

      --
      Allolex

Re: Tracking popularity of Perl discussion topics
by valdez (Monsignor) on Sep 09, 2003 at 01:43 UTC
Re: Tracking popularity of Perl discussion topics
by dmitri (Priest) on Sep 08, 2003 at 22:06 UTC
    Maybe a formula can be used, something like this:

    
    sigma(nodes in thread){reputation/(days old) * some_weight}
    
    

    It seems that it would not be very difficult to periodically update thread heads with this "popularity" index.

    Update: hmm. I guess I did suggest changing the site itself. My bad.

      I think that allolex was suggesting something beyond single threads. You would either have to broadly categorize threads manually, or else use some kind of automated criteria (eg. keywords).

      For instance, one might find that, over the last year all threads dealing with CGI forms averaged X. To obtain X you could use your suggested calculation averaged across the number of threads, and perhaps weighted a little towards topics with more threads. I don't know, I'm out of my league on the math end of it.

      The sticky bit is trying to categorize threads without human intervention.

      </ajdelore>

Re: Tracking popularity of Perl discussion topics
by kleucht (Beadle) on Sep 09, 2003 at 00:36 UTC
    Being kind of a newbie, I don't actually have any technical suggestions about data harvesting. I just wanted to let you know that I do have some interest in knowing which Nodes are the most popular. This would allow me to theoretically increase my efficiency while perusing the perlmonks website. Of course it won't always keep me from wasting my time, but it might occasionally help me to stay abreast of some of the more popular topics and such.

    Can't wait to see it implemented!!!
    :-)

      IMHO, if your primary interest is in skimming the cream of the crop, so to speak, you should be looking in the following places (and, in fact, probably don't need Yet Another Tool in your kit).

      • Daily Best. The Daily Best nodelet can be enabled from the user preferences section on your home node. These are the nodes with the highest reputations for the day. Perhaps reputation is a more worthy indicator of merit than popularity.

      • Weekly best. The Weekly Best nodelet can be enabled through the user preferences section on your home node. These are the nodes that have become of highest repute for the week. You'll definately find some choice topics here.

      • Update: The Best Nodes page gives a summary of Daily best, Weekly best, and Alltime Best nodes. Thanks Jeffa for the reminder.

      • Front-paged nodes. These are the ones that show up when you first enter the Monastary, or when you click on The Monastary Gates section. Front-page nodes were front-paged by those who have obtained a certain level of positive experience in the monastery, and generally are nodes that ask good questions and give good answers.

      • The topic section that most interests you today.

      • The Q&A, and the Tutorials sections.

      With those starting points you will find the best of the best posts, without having to sift very long. Popularity would only tell you how many people clicked on a node. Reputation will tell you roughly what people thought of it.

      Dave

      "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

Re: Tracking popularity of Perl discussion topics
by halley (Prior) on Sep 09, 2003 at 13:16 UTC
    Take a list of the N most common words in the English language. This is your "useless words" list. Also create an empty "useless phrases" list.

    Take the text of a given node. Count occurrences of words that are not useless. Also count occurrences of any consecutive runs of not-useless words, such as "and password authentication is".

    Do this for a few hundred scattered nodes, and add some of the less useful words and phrases found to the useless lists. This step is optional but you'll weed a lot of chaff early if you do it.

    # this node minus chaff 7 : words 6 : useless 4 : nodes 4 : phrases 3 : regard 3 : replies 3 : reputation 3 : useful ... 1 : are highly regarded 1 : password authentication 1 : useful words 1 : phrases found depending 1 : empty useless phrases ...

    Define what "regard" is: XP/reputation? Number of replies? Replies by saints? Front-paged? It's up to you to decide what is important.

    Now you can start automating. Read nodes. Assign incremental regard to the most prevalent useful words and phrases found, depending on reputation or number of replies. Associate useful phrases with sets of nodes that are highly regarded.

    --
    [ e d @ h a l l e y . c c ]

      Thanks a lot for your outline. It looks like a quick and efficient way of categorizing threads. I would like to add some linguistic knowledge to your algorithm, though. :)

      The more I think about how exactly I would do an implementation, the less I like the idea of just knocking off the most common words, since they can be important in combination with other words. If we want to have big endian, little endian, and the topic of dealing with big or large files, we have have to reorganize the I think that maybe getting rid of what linguists call "functional categories" such as determiners (the, a, these) entirely and leaving quantifiers (some, many, all) up to the search engine, we might be able to retain those common words that do play some role in defining a topic. I think lexical categories like nouns, verbs, adjectives/adverbs (but not prepositions) are the way to go.

      I think what you call "regard" here should be (XP_MAX_REPLY + XP_MIN_REPLY) * 0.5, but I'd have to examine how XP is really distributed across nodes in a thread before going further. Plus, this little formula doesn't add or substract XP significance according to where the node is nested. (I think it would be a good idea to count replies to replies, maybe down to the third level of nesting. After that, the topic value tends to be either too specific to just a couple of the posters personal usage, or simply irrelevant to the original question.

      --
      Allolex

Re: Tracking popularity of Perl discussion topics
by calin (Deacon) on Sep 09, 2003 at 18:21 UTC
    It happened that I did post my first node here at the Monastery in the same thread, and the exceptionally high rep my post got (and other replies too) made me think and sleep over it, and I belive I've come come to some explanation.

    Yes, the topic played it's part, but I think it's not entirely responsible for the high reps. For that, all the stars had to be properly aligned. Monks may not be aware of the fact that the story was not only frontpaged, but it stayed on top of the front page for more than one day! With the favourable time, space and topic combined, the story had sizable exposure (best_seat x long_time x lure_factor).

    When monks see that a question has many replies, they infer that there's lively discussion going on, and they click. There were many replies in the beginning (there was no definitive solution, some proposed solutions were flawed, the possibility of DoS attacks was pointed out etc.) This, together with the top exposure created a positive feedback environment that brought in more visits and more replies (and more votes).

    I could elaborate on this but I'm afraid of giving birth to Yet Another Crackpot Theory (YACT) (tm).