Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

We are writing a simple web-based ticket support system. What we would like is that after a client submits his/her question, the system would first search for similar articles in the knowledgebase. If similar articles are found, the client would be given a chance to cancel its ticket submission and read the articles instead.

The knowledge base that we have is about around 2000 articles managed by Mediawiki. Mediawiki works fine for now and the staffs like it. This is why we don't want to have to move them into a complete ticketing system with its own knowledge base system, etc.

Are there Perl modules able to do similar searching against a collection of text, e.g. in a MySQL database?

Replies are listed 'Best First'.
Re: Similar text search
by moritz (Cardinal) on Apr 03, 2008 at 12:16 UTC
    I know of no module that does this search, but you could try to do it yourself.

    The first step should be to strip stop words from the input (it might be enough to use the subject of the new ticket as input, you'll have to try that). Lingua::StopWords might help you.

    Then you have to search the database. You can use a fulltext index on the columns where title and content of the wiki are stored.

    Maybe it's easier to let KinoSearch do the work for you. It's quite fast, and it does stemming automatically for you. (This might not be well suited if your wiki pages change very often, but if they're fairly statically it shouldn't be a problem to keep KinoSearch's index up to date).

Re: Similar text search
by derby (Abbot) on Apr 03, 2008 at 13:11 UTC

    I'd have to agree with moritz, I know of no module that will do this for you but KinoSearch may be a great place to start (I've done similar apps with KinoSearch's soulmate Lucene). Also, I've been reading Collective Intelligence which goes into detail about the algorithms to do this type of app. All of the examples are in Python but I have yet to come across any of the examples that I could not translate easily into perl (and I know nothing about Python).

    -derby
Re: Similar text search
by locked_user sundialsvc4 (Abbot) on Apr 03, 2008 at 13:53 UTC

    My question would be ... does MediaWiki have any sort of API that could be used to leverage it?

    Sure, you could “replace” what you have now, but especially since “the staff likes it,” is it possible to make it work harder for you than it now does?

Re: Similar text search
by wade (Pilgrim) on Apr 03, 2008 at 16:07 UTC

    Well, since MediaWiki works over MySQL, couldn't you use dbi or mysql to do the search? Once you found the page, you could use LWP to render it. Just a thought.

    --
    Wade
Re: Similar text search
by planetscape (Chancellor) on Apr 06, 2008 at 06:06 UTC
Re: Similar text search
by leocharre (Priest) on Apr 03, 2008 at 21:58 UTC

    There's this pretty fascinating module you may want to look at: String::Similarity, you could use it on the subject headers for the tickets.

Re: Similar text search
by Anonymous Monk on Apr 04, 2008 at 00:52 UTC

    Thanks for the responses. Since posting this post, I have tried String::Similarity and Text::Similarity. T::S doesn't work for me, it always gives me "0" as the result (and btw, I have to patch it first to grok non-files as the module originally only accept file names as arguments).

    As for S::S, it's very slow (5-10 articles per second on my computer). And it's not the right approach I think. It's a generic method to compare two strings, not text. We would need to have an algorithm that are language-aware for better results (e.g. working on a word or sentence level and not characters, can do stopwords filtering, can do stemming, can weight words according to usage frequency, etc).

    As for MediaWiki, I do plan to access its MySQL database directly instead of over its API. No need to do rendering myself, as I only need to give clients URLs to the knowledge base relevant articles.

Re: Similar text search
by Anonymous Monk on Apr 04, 2008 at 03:34 UTC

    I found "remembrance-agent" in one of Debian's packages. Seems to work great. I think I'm gonna use this for now instead of cooking up my own solution using Perl.