in reply to Re: Interventionist Unicode Behaviors
in thread Interventionist Unicode Behaviors
Thanks for the kind words and the constructive criticism. My original goal was to make KinoSearch polymorphic, so that you could use whatever encoding you wanted. This turned out to be a mistake, and now I'm centering on UTF-8. I will add to the documentation that KinoSearch expects either UTF-8 or Latin-1, and that it will translate anything coming in that isn't flagged as UTF-8 working on the assumption that it is Latin-1.
By far the simplest and best solution from my standpoint is to always output UTF-8. KinoSearch is large (80 modules) and there are enough points of egress that it would be best if I didn't have to double up code at each of them. That's what I've pretty much settled on. However, a user who wrote up a very nice bug report for me, including a test (!), was confused and concerned by the fact that the version which solved his problem also happened to issue a "Wide character in print" warning. Ergo, this thread.
The 12 languages are listed in the docs for KinoSearch::Analysis::PolyAnalyzer. You're right that it could be plainer what they are, and I will either add the list verbatim or a direct link to it from the main KinoSearch documentation page. With regards to "support" for a language, that means a stemmer and a stoplist are available, and the regex-based tokenizer works fine. Throwing an occasional Polish document into an English collection won't cause any problems, just garbage that you'll only see in search results when your luck has gone weird. Finnish is on the list of supported languages, so it looks like I will need to modify the "Indo-European" tag. :)
|
|---|