in reply to Similar text search

Thanks for the responses. Since posting this post, I have tried String::Similarity and Text::Similarity. T::S doesn't work for me, it always gives me "0" as the result (and btw, I have to patch it first to grok non-files as the module originally only accept file names as arguments).

As for S::S, it's very slow (5-10 articles per second on my computer). And it's not the right approach I think. It's a generic method to compare two strings, not text. We would need to have an algorithm that are language-aware for better results (e.g. working on a word or sentence level and not characters, can do stopwords filtering, can do stemming, can weight words according to usage frequency, etc).

As for MediaWiki, I do plan to access its MySQL database directly instead of over its API. No need to do rendering myself, as I only need to give clients URLs to the knowledge base relevant articles.