comment on

(My apologies if I'm really being too much of a pest...)

For example, I'd like to have KinoSearch output scalars flagged as UTF-8 by default. (The current working version in my subversion repository handles all text as UTF-8 internally.) But if I do that, then if there are any "wide characters" -- code points above 255 -- in the stream, downstream users will see those "wide character in print" warnings.

Well, considering that the current man page for KinoSearch mentions "Full support for 12 Indo-European languages" and shows ( language => 'en' ) a couple of times in the synopsis, I think it behooves you to say something right up front there about what your expections are (and what users should expect) regarding character encoding.

It would be fine to tell users that you expect them to provide the module with utf8 data (or at least tell you what the correct encoding is for the input data), and that things could go badly for them if they don't give you that. For the 12 languages you support, it's virtually gauranteed that Perl's standard set of supported encodings covers them all easily, and you can either handle that internally in your module code (assuming people tell you which encoding to use on which data), or else include some brief examples in your tutorials to cover conversion to utf8.

If you're going to accept non-unicode data as input for indexing, you should probably expect that those users will want (need) the same encoding for the outputs that you give them -- adjust your modules accordingly so you give back data the same way it is given to you.

If you choose to accept only ut8 input, it's okay to tell users that they are going to get utf8 output from you, and they need to handle it (again, a couple lines in a tutorial or synopsis should suffice).

With all due respect for an awesome piece of work, I have to say that many of the problems in module-vs-unicode situations arise because the module documentation implies or advertises support for text in multiple languages, but says nothing about character encoding. Why leave people to guess about that, esp. given that it can become so bizarre when it goes awry.

While you're at it, it could be helpful to list which 12 of the few hundred current Indo-European languages you support. (For all I know, you might be supporting Catalan, Gaelic, Irish, Hungarian, Finnish and/or Turkish, which actually are not Indo-European... :) Maybe you could even give a few hints about what "support" actually means here -- e.g. whether module behavior adapts in different ways to different languages, and if so, how... (Like, if I say the language is "en", but there happens to be a few Polish docs in there by mistake, does it blow up?)

In reply to Re: Interventionist Unicode Behaviors by graff
in thread Interventionist Unicode Behaviors by creamygoodness

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.