in reply to Interventionist Unicode Behaviors

(My apologies if I'm really being too much of a pest...)
For example, I'd like to have KinoSearch output scalars flagged as UTF-8 by default. (The current working version in my subversion repository handles all text as UTF-8 internally.) But if I do that, then if there are any "wide characters" -- code points above 255 -- in the stream, downstream users will see those "wide character in print" warnings.
Well, considering that the current man page for KinoSearch mentions "Full support for 12 Indo-European languages" and shows ( language => 'en' ) a couple of times in the synopsis, I think it behooves you to say something right up front there about what your expections are (and what users should expect) regarding character encoding.

It would be fine to tell users that you expect them to provide the module with utf8 data (or at least tell you what the correct encoding is for the input data), and that things could go badly for them if they don't give you that. For the 12 languages you support, it's virtually gauranteed that Perl's standard set of supported encodings covers them all easily, and you can either handle that internally in your module code (assuming people tell you which encoding to use on which data), or else include some brief examples in your tutorials to cover conversion to utf8.

If you're going to accept non-unicode data as input for indexing, you should probably expect that those users will want (need) the same encoding for the outputs that you give them -- adjust your modules accordingly so you give back data the same way it is given to you.

If you choose to accept only ut8 input, it's okay to tell users that they are going to get utf8 output from you, and they need to handle it (again, a couple lines in a tutorial or synopsis should suffice).

With all due respect for an awesome piece of work, I have to say that many of the problems in module-vs-unicode situations arise because the module documentation implies or advertises support for text in multiple languages, but says nothing about character encoding. Why leave people to guess about that, esp. given that it can become so bizarre when it goes awry.

While you're at it, it could be helpful to list which 12 of the few hundred current Indo-European languages you support. (For all I know, you might be supporting Catalan, Gaelic, Irish, Hungarian, Finnish and/or Turkish, which actually are not Indo-European... :) Maybe you could even give a few hints about what "support" actually means here -- e.g. whether module behavior adapts in different ways to different languages, and if so, how... (Like, if I say the language is "en", but there happens to be a few Polish docs in there by mistake, does it blow up?)

Replies are listed 'Best First'.
Re^2: Interventionist Unicode Behaviors
by creamygoodness (Curate) on Sep 08, 2006 at 14:34 UTC

    Thanks for the kind words and the constructive criticism. My original goal was to make KinoSearch polymorphic, so that you could use whatever encoding you wanted. This turned out to be a mistake, and now I'm centering on UTF-8. I will add to the documentation that KinoSearch expects either UTF-8 or Latin-1, and that it will translate anything coming in that isn't flagged as UTF-8 working on the assumption that it is Latin-1.

    By far the simplest and best solution from my standpoint is to always output UTF-8. KinoSearch is large (80 modules) and there are enough points of egress that it would be best if I didn't have to double up code at each of them. That's what I've pretty much settled on. However, a user who wrote up a very nice bug report for me, including a test (!), was confused and concerned by the fact that the version which solved his problem also happened to issue a "Wide character in print" warning. Ergo, this thread.

    The 12 languages are listed in the docs for KinoSearch::Analysis::PolyAnalyzer. You're right that it could be plainer what they are, and I will either add the list verbatim or a direct link to it from the main KinoSearch documentation page. With regards to "support" for a language, that means a stemmer and a stoplist are available, and the regex-based tokenizer works fine. Throwing an occasional Polish document into an English collection won't cause any problems, just garbage that you'll only see in search results when your luck has gone weird. Finnish is on the list of supported languages, so it looks like I will need to modify the "Indo-European" tag. :)

    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com