Thanks for the kind words and the constructive criticism. My original goal was to make KinoSearch polymorphic, so that you could use whatever encoding you wanted. This turned out to be a mistake, and now I'm centering on UTF-8. I will add to the documentation that KinoSearch expects either UTF-8 or Latin-1, and that it will translate anything coming in that isn't flagged as UTF-8 working on the assumption that it is Latin-1.

By far the simplest and best solution from my standpoint is to always output UTF-8. KinoSearch is large (80 modules) and there are enough points of egress that it would be best if I didn't have to double up code at each of them. That's what I've pretty much settled on. However, a user who wrote up a very nice bug report for me, including a test (!), was confused and concerned by the fact that the version which solved his problem also happened to issue a "Wide character in print" warning. Ergo, this thread.

The 12 languages are listed in the docs for KinoSearch::Analysis::PolyAnalyzer. You're right that it could be plainer what they are, and I will either add the list verbatim or a direct link to it from the main KinoSearch documentation page. With regards to "support" for a language, that means a stemmer and a stoplist are available, and the regex-based tokenizer works fine. Throwing an occasional Polish document into an English collection won't cause any problems, just garbage that you'll only see in search results when your luck has gone weird. Finnish is on the list of supported languages, so it looks like I will need to modify the "Indo-European" tag. :)

--
Marvin Humphrey
Rectangular Research ― http://www.rectangular.com

In reply to Re^2: Interventionist Unicode Behaviors by creamygoodness
in thread Interventionist Unicode Behaviors by creamygoodness

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.