For example, I'd like to have KinoSearch output scalars flagged as UTF-8 by default. (The current working version in my subversion repository handles all text as UTF-8 internally.) But if I do that, then if there are any "wide characters" -- code points above 255 -- in the stream, downstream users will see those "wide character in print" warnings.Well, considering that the current man page for KinoSearch mentions "Full support for 12 Indo-European languages" and shows ( language => 'en' ) a couple of times in the synopsis, I think it behooves you to say something right up front there about what your expections are (and what users should expect) regarding character encoding.
It would be fine to tell users that you expect them to provide the module with utf8 data (or at least tell you what the correct encoding is for the input data), and that things could go badly for them if they don't give you that. For the 12 languages you support, it's virtually gauranteed that Perl's standard set of supported encodings covers them all easily, and you can either handle that internally in your module code (assuming people tell you which encoding to use on which data), or else include some brief examples in your tutorials to cover conversion to utf8.
If you're going to accept non-unicode data as input for indexing, you should probably expect that those users will want (need) the same encoding for the outputs that you give them -- adjust your modules accordingly so you give back data the same way it is given to you.
If you choose to accept only ut8 input, it's okay to tell users that they are going to get utf8 output from you, and they need to handle it (again, a couple lines in a tutorial or synopsis should suffice).
With all due respect for an awesome piece of work, I have to say that many of the problems in module-vs-unicode situations arise because the module documentation implies or advertises support for text in multiple languages, but says nothing about character encoding. Why leave people to guess about that, esp. given that it can become so bizarre when it goes awry.
While you're at it, it could be helpful to list which 12 of the few hundred current Indo-European languages you support. (For all I know, you might be supporting Catalan, Gaelic, Irish, Hungarian, Finnish and/or Turkish, which actually are not Indo-European... :) Maybe you could even give a few hints about what "support" actually means here -- e.g. whether module behavior adapts in different ways to different languages, and if so, how... (Like, if I say the language is "en", but there happens to be a few Polish docs in there by mistake, does it blow up?)
In reply to Re: Interventionist Unicode Behaviors
by graff
in thread Interventionist Unicode Behaviors
by creamygoodness
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |