Similar text search

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Similar text search by moritz (Cardinal) on Apr 03, 2008 at 12:16 UTC
I know of no module that does this search, but you could try to do it yourself. The first step should be to strip stop words from the input (it might be enough to use the subject of the new ticket as input, you'll have to try that). Lingua::StopWords might help you. Then you have to search the database. You can use a fulltext index on the columns where title and content of the wiki are stored. Maybe it's easier to let KinoSearch do the work for you. It's quite fast, and it does stemming automatically for you. (This might not be well suited if your wiki pages change very often, but if they're fairly statically it shouldn't be a problem to keep KinoSearch's index up to date).	[reply]
Re: Similar text search by derby (Abbot) on Apr 03, 2008 at 13:11 UTC
I'd have to agree with moritz, I know of no module that will do this for you but KinoSearch may be a great place to start (I've done similar apps with KinoSearch's soulmate Lucene). Also, I've been reading Collective Intelligence which goes into detail about the algorithms to do this type of app. All of the examples are in Python but I have yet to come across any of the examples that I could not translate easily into perl (and I know nothing about Python). -derby	[reply]
Re: Similar text search by locked_user sundialsvc4 (Abbot) on Apr 03, 2008 at 13:53 UTC
My question would be ... does MediaWiki have any sort of API that could be used to leverage it? Sure, you could “replace” what you have now, but especially since “the staff likes it,” is it possible to make it work harder for you than it now does?
Re: Similar text search by wade (Pilgrim) on Apr 03, 2008 at 16:07 UTC
Well, since MediaWiki works over MySQL, couldn't you use dbi or mysql to do the search? Once you found the page, you could use LWP to render it. Just a thought. -- Wade	[reply]
Re: Similar text search by planetscape (Chancellor) on Apr 06, 2008 at 06:06 UTC
Your title made me think of something I'd been investigating recently: code similarity analyzers (I first encountered that term here). You may also find the thread I cross-referenced to contain some interesting ideas. The subject of text mining inevitably comes up, and text mining and Fingerprinting text documents for approximate comparison may give you some useful ideas and/or resources. Recently, when I needed to do some keywording/summarizing, I hacked together a wee little script using (among other things): `Lingua::EN::Keywords` `Lingua::EN::Summarize` ... and it worked pretty well, actually. :-) HTH, planetscape	[reply] [d/l] [select]
Re: Similar text search by leocharre (Priest) on Apr 03, 2008 at 21:58 UTC
There's this pretty fascinating module you may want to look at: String::Similarity, you could use it on the subject headers for the tickets.	[reply]
Re: Similar text search by Anonymous Monk on Apr 04, 2008 at 00:52 UTC
Thanks for the responses. Since posting this post, I have tried String::Similarity and Text::Similarity. T::S doesn't work for me, it always gives me "0" as the result (and btw, I have to patch it first to grok non-files as the module originally only accept file names as arguments). As for S::S, it's very slow (5-10 articles per second on my computer). And it's not the right approach I think. It's a generic method to compare two strings, not text. We would need to have an algorithm that are language-aware for better results (e.g. working on a word or sentence level and not characters, can do stopwords filtering, can do stemming, can weight words according to usage frequency, etc). As for MediaWiki, I do plan to access its MySQL database directly instead of over its API. No need to do rendering myself, as I only need to give clients URLs to the knowledge base relevant articles.	[reply]
Re: Similar text search by Anonymous Monk on Apr 04, 2008 at 03:34 UTC
I found "remembrance-agent" in one of Debian's packages. Seems to work great. I think I'm gonna use this for now instead of cooking up my own solution using Perl.	[reply]