Interventionist Unicode Behaviors

Replies are listed 'Best First'.
Never touch (or look at) the UTF8 flag!! by Juerd (Abbot) on Sep 08, 2006 at 08:54 UTC
What's the rationale behind some of Perl's aggressively "helpful" UTF-8 handling features? The rationale is not what you think it is, but more importantly: Perl's UTF-8 handling is not what you think it is. From your code, I can only assume that you are rying to apply your character encoding knowledge to Perl without first learning about Perl's way of handling text strings. Perl has two kinds of strings, but you can't see what kind of string a given string is. You have to keep track of this yourself. The first kind is the default kind: the binary string. The second kind is: the text string. Please note that text strings do not have any encoding! (Though internally, it's utf-8) `#!/usr/bin/pe 1. my $smiley = "\xE2\x98\xBA"; 2. print $smiley . "\n"; 3. _utf8_on($smiley); 4. print $smiley . "\n";` [download] You assign three characters to $smiley. Internally, this may be encoded as latin-1 (three bytes) or as utf-8 (six bytes). Either way, they are three different characters, not one character consisting of three bytes!! You print the string, but there is no output encoding set on STDOUT, and you don't encode it explicitly. Perl can't know what to do, so it just outputs the bytes as they exist in the string. If it happened to be stored as latin-1 before, then the output will be three bytes. If it happened to be stored as utf-8 before, then the output will be six bytes. RED FLAG. Here you manually switch on the internal UTF8 flag. You should NEVER do this, unless you know all the details of Perl's UTF8 handling. If the string happened to be stored as latin-1 before, you're lucky because this sequence of bytes happens to also be valid utf-8: you get one smiley. If the string happened to be stored as utf-8 before, nothing happens because the UTF8 flag was already set. You print the string again. While before you might have had a byte encoding, you now certainly have a wide character in your string, and Perl warns you that you're doing something stupid: you're printing a string that has a wide character in it, so you should have encoded it explicitly. As you can see, two very different things can happen to $smiley. This may seem useless and totally wrong, because it makes Perl unpredictable. However, if you code the way the Perl gods intended, things go right, and Unicode support in Perl proves to be incredibly helpful and not all that weird anymore. First of all, your string "\xE2\x98\xBA" is wrong if you want a smiley. These three characters are: `U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX U+0098 START OF STRING U+00BA MASCULINE ORDINAL INDICATOR` [download] If you wanted the smiley face, you should have asked for it! `U+263A WHITE SMILING FACE` [download] There are several ways of doing this: `# The \x{...} notation my $smiley = "\x{263A}"; # The \N{...} notation use charnames ':full'; my $smiley = "\N{WHITE SMILING FACE}"; # A literal UTF-8 encoded smiley face in your code use utf8; my $smiley = "☺";` Secondly, when you print something, you should let Perl know in which encoding you would like your data. This can again be done in several ways: `# Explicit encode() use Encode qw(encode); print encode("UTF-8", $smiley): # Set the filehandle to the encoding UTF-8 binmode STDOUT, ":encoding(UTF-8)"; print $smiley; # Set the filehandle to :utf8, a shortcut syntax because you need it s +o often binmode STDOUT, ":utf8"; print $smiley` [download] The thing to remember is that Perl does handling of encodings (like UTF-8) for you, and that you shouldn't do it yourself. You encoded your smiley face as UTF-8 yourself, and then let Perl know the three individual bytes. That's a possibility, but it's much better in many ways if you let Perl handle this for you. Is there any reason why STDOUT shouldn't just print whatever gets thrown at it? How about, if I want it to perform automatic translation, I turn that feature on? It's funny that you describe exactly what Perl does. It prints whatever it gets, whether that data makes sense or not. If you want automatic translation (You do!!), you have to turn that feature on. Another "feature" that's bitten me in the butt is the silent "upgrading" (i.e. corruption) of non-UTF8 scalars when concatenated with UTF8 scalars... It bites only if you do something wrong, or have your environment setup badly. In this case, you're doing something wrong. You see, a text string should always be encode()d explicitly before you concatenate it to any non-text string. The packed num is not a text string, but a sequence of bytes. If you need both these strings in one stream, obviously it's a binary stream, and you need to encode the text string using the required encoding. For example: use utf8; use Socket qw(inet_aton); use Encode qw(encode); my $text_string = "Héllų wõrld!"; my $binary_string = inet_aton("127.0.0.1"); my $data_to_send; # Now, we need to send both in one go! # But we can't do that directly, because $text_string needs to be enco +ded first. # How shall we encode it? # As UTF-8? $data_to_send = encode("UTF-8", $text_string) . $binary_string; print $data_to_send; # As ISO-8859-1? $data_to_send = encode("ISO-8859-1", $text_string) . $binary_string; print $data_to_send; # As KOI8-R? $data_to_send = encode("KOI8-R", $text_string) . $binary_string; print $data_to_send; # Oops, these characters don't exist in KOI8-R, so Perl used question +marks. Heehee :) [download] Think there's any chance these behaviors could change in Perl 5.10? Is it worth bringing up on p5p? No chance this will change, because it's a solid system. I'm sure that once you understand how it works, you will also be able to use it, and maybe even love it. Perl 6, on the other hand, doesn't have to be compatible with legacy code, and it has a very nice type system. The two combined make that Perl 6 can use two different string types, Buf and Str. They're very simple, and Perl will scream if you ever try to combine the two in concatenation. `my Buf $byte_string; my Str $text_string;` [download] Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply] [d/l] [select]
Re: Never touch (or look at) the UTF8 flag!! by creamygoodness (Curate) on Sep 08, 2006 at 10:58 UTC
Juerd, RED FLAG. Here you manually switch on the internal UTF8 flag. You should NEVER do this, unless you know all the details of Perl's UTF8 handling. If the string happened to be stored as latin-1 before, you're lucky because this sequence of bytes happens to also be valid utf-8: you get one smiley. If the string happened to be stored as utf-8 before, nothing happens because the UTF8 flag was already set. Luck had nothing to do with it. :) It was three bytes in the string because I specified those exact bytes. Then I turned on the UTF8 flag and got exactly what I wanted: a Perl scalar with a `PV` of `0xE2 0x98 0xBA 0x00`, a `LEN` of 4, a `CUR` of 3, with the `SVf_UTF8` flag ~~set~~ unset, yada yada. I chose not to represent the string as `"\x{263a}"` or `"\N{WHITE SMILING FACE}"` because in both of those cases the `SVf_UTF8` flag would have been set -- whereas by using raw hex notation, I coerced Perl into parsing the string using byte semantics. `#!/usr/bin/perl use strict; use warnings; use Devel::Peek; my $bytes = "\xE2\x98\xBA"; my $uni = "\x{263a}"; # Only one difference between these: the UTF8 flag is on for $uni Dump($bytes); Dump($uni); __END__ Outputs: SV = PV(0x1801660) at 0x180b584 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x300bd0 "\342\230\272"\0 CUR = 3 LEN = 4 SV = PV(0x1801678) at 0x180b560 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x316c80 "\342\230\272"\0 [UTF8 "\x{263a}"] CUR = 3 LEN = 4` [download] I'm personally quite comfortable turning the UTF8 flag on and off, as I'm confident that I know when a sequence of bytes is valid UTF-8 and when it's not. I do a lot of XS programming, and I've written a fair amount of low-level unicode processing code. See this patch of mine for Apache Lucene that fixes the IO classes so that they read/write legal UTF-8 (which they'd claimed they were using in their specs) rather than the Modified UTF-8 they'd actually been using. However switching `SVf_UTF8` on and off is not something I do lightly, or that I would recommend to the casual user, so there we are in agreement. I'm sure that once you understand how it works, you will also be able to use it, and maybe even love it. The basic system is not mysterious. `SVf_UTF8` is either on or it isn't. (and if it's on, it better be right :). Perl 6 can use two different string types, Buf and Str. If Str is limited to Unicode and only Unicode, that's Nirvana... -- Marvin Humphrey Rectangular Research ― http://www.rectangular.com	[reply] [d/l] [select]
Re^2: Never touch (or look at) the UTF8 flag!! by demerphq (Chancellor) on Sep 08, 2006 at 12:51 UTC
# Only one difference between these: the UTF8 flag is on for $uni As far as I understand it this need not be true. You make the comment that you do a lot of XS programming. It seems to me that you are equating the XS concept of altering the PV with the perl concept of assigning to a string. Which with byte semantics is correct, but with utf8 semantics is not. I dont believe that there is an guarantee that this will always be true. For instance perl 5.12 could be entirley unicode internally and your program would break. Likewise, if the string were utf8-on before you did the assignment the result would be different. I think probably if you want to operate on the level you seem to I think you should use pack. Also a little nit: perl doesnt do UTF-8, it does utf8, which is subtlely different from true UTF-8. Although in the context here I dont think it matters. --- $world=~s/war/peace/g	[reply]
Re^3: Never touch (or look at) the UTF8 flag!! by creamygoodness (Curate) on Sep 08, 2006 at 13:43 UTC
Re^2: Never touch (or look at) the UTF8 flag!! by Juerd (Abbot) on Sep 11, 2006 at 09:32 UTC
Luck had nothing to do with it. :) It was three bytes in the string because I specified those exact bytes. If this piece of code was in a module, and the main script disagreed about the encoding, things would have broken. For example, if "use encoding 'iso-8859-1';" was used: you'd get a UTF8 string of three characters, but six bytes. That's why you should never use \x for literal bytes. Instead, use pack with a "C*" template. Only if nothing in your script uses the new stuff, you can be sure you get the old stuff. I'm personally quite comfortable turning the UTF8 flag on and off, as I'm confident that I know when a sequence of bytes is valid UTF-8 and when it's not. I do a lot of XS programming, and I've written a fair amount of low-level unicode processing code. See this patch of mine for Apache Lucene that fixes the IO classes so that they read/write legal UTF-8 (which they'd claimed they were using in their specs) rather than the Modified UTF-8 they'd actually been using. For someone who's sufficiently skilled in Perl, unicode, and the combination of both, you managed to appear quite clueless in the OP. But now I wonder if you were actually serious (if so, please rephrase your question, this time based on the way you SHOULD use things), or just trolling. Perl 6 can use two different string types, Buf and Str. If Str is limited to Unicode and only Unicode, that's Nirvana... It is. A Buf can have an :encoding attribute, but a Str is always unicode. That is: unicode, not utf-8: exactly as in Perl 5, you must not care about the internal encoding unless you're actually doing internal things. Perl does unicode for its text strings, not utf-8. That's why you have to decode() and encode() yourself. Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply]
Re^3: Never touch (or look at) the UTF8 flag!! by creamygoodness (Curate) on Sep 11, 2006 at 19:48 UTC
Re^4: Never touch (or look at) the UTF8 flag!! by Juerd (Abbot) on Sep 11, 2006 at 22:11 UTC
Re: Interventionist Unicode Behaviors by graff (Chancellor) on Sep 08, 2006 at 05:29 UTC
For instance, STDOUT takes it upon itself to attempt translation of data rather than just printing what I want... (code snippet) Is there any reason why STDOUT shouldn't just print whatever gets thrown at it? How about, if I want it to perform automatic translation, I turn that feature on? Huh? In your snippet, ~~perl~~ STDOUT is "just printing whatever gets thrown at it", without doing any sort of "translation" on it. The first output is a three-byte sequence that, when viewed on a utf8-aware display, will show a single unicode character. If you want to see that as three separate bytes, print to something that does not do utf8 interpretation -- e.g. (in unix): `perl -e 'print "\xE2\x98\xBA"' \| od -txC` [download] In the second output, you've told perl that your three-byte string should be interpreted by perl internally as utf8 data, and then you print it to a file handle that has not been configured for that encoding, so you get the warning, but that's just a warning, and the output is effectively the same as it was before -- and how you see it will depend on what you use to view it. (In perl 5.8.0, esp. with Red Hat, perl actually referred to the user's "locale" settings in order to "automagically" do utf8 conversion on output whenever the locale cited utf8; everyone quickly agreed that this was a Big Mistake^TM, and the behavior was corrected in 5.8.1, never to return.) Another "feature" that's bitten me in the butt is the silent "upgrading" (i.e. corruption) of non-UTF8 scalars when concatenated with UTF8 scalars... Conceptually, appending a non-UTF8 string to a UTF8 string is a really bad idea, bordering on stupid. Don't do that. (Why would you want to? What would you hope to accomplish as a result?) Your second snippet shows the "special" (quasi-ambiguous) status of byte values in the \x80-\xFF range in perl 5.8: (<update>:) when used in a "raw" (non-utf8) context, they are treated simply as single byte values without further ado -- e.g. `print "\xA0"` prints just one byte when STDOUT is in ":raw" (default) mode -- but (</update>) when used in a utf8 context (e.g. appended to a utf8 string or printed to a file handle that is set to utf8 mode), they are automatically "upgraded" to utf8 characters by changing the single byte to its two-byte utf8 equivalent. For people migrating out of iso-8859-1 into unicode (which is quite a few people, even now), this prevents a lot more trouble than it creates. Admittedly, a lot of people who don't yet understand unicode and/or utf8 can and do get into trouble with this. As for your "preferred API", I don't think I understand what you are trying to demonstrate with the first two "print" statements. As for the third print statement ("$utf8 . $non_utf8"), if the latter scalar contains data that cannot be parsed as utf8, any utf8-aware display will simply put question-marks for the bytes that make no sense. That's what the Unicode Standard says is the appropriate thing to do; Perl will only tell you your non-utf8 data cannot be used directly as utf8 if/when you try to do: `decode( 'utf8', $non_utf8, Encode::FB_CROAK ); # or Encode::FB_WARN` [download] or you can do the "default" decoding, without the third "check" parameter, and the resulting string will contain one or more `\x{FFFD}` unicode characters (rendered in three utf8 bytes, of course), which refers to a code point labeled "REPLACEMENT CHARACTER", which will either be ignored or show up as a question-mark, depending on what utf8-aware tool you use to view it. If you have non-utf8 data and you want to "display" it using a utf8-aware terminal/window, you need to figure out how to make it intelligible, both to the displayer and to the user. To get rid of the "wide character in print" warnings, do `binmode FILEHANDLE, ":utf8"` or use the three-arg version of the "open" statement when opening an output file: `open FH, ">:utf8", $filename` -- check the man page for "open" (perldoc -f open).	[reply] [d/l] [select]
Re^2: Interventionist Unicode Behaviors by creamygoodness (Curate) on Sep 08, 2006 at 08:54 UTC
Huh? In your snippet, perl is "just printing whatever gets thrown at it", without doing any sort of "translation" on it. It's trying and failing to convert Unicode code point 0x263a to Latin-1. We see the warning because it's impossible to translate a code point that high to Latin 1. I thought the example I gave was the easiest to grok, but this is probably better, because the output is actually different. `#!/usr/bin/perl use strict; use warnings; use Encode qw( _utf8_on ); my $resume = "r\xc3\xa9sum\xc3\xa9"; print $resume, "\n"; _utf8_on($resume); print $resume, "\n";` [download] Conceptually, appending a non-UTF8 string to a UTF8 string is a really bad idea, bordering on stupid. Don't do that. (Why would you want to? What would you hope to accomplish as a result?) I'd like to spit out scalars flagged as UTF8 by default from KinoSearch. But if I do that, that means anybody who gets that output is going to have to know how to deal with them. I don't want to spend all my time explaining the bottomless intricacies of Unicode handling in Perl to people. It's not that I want to be doing a lot of this concatenation, it's that I know it's going to happen some of the time and I don't want the support burden. -- Marvin Humphrey Rectangular Research ― http://www.rectangular.com	[reply] [d/l]
Re^3: Interventionist Unicode Behaviors by Juerd (Abbot) on Sep 08, 2006 at 09:47 UTC
It's trying and failing to convert Unicode code point 0x263a to Latin-1. No, it is not. You asked for the code points E2, 98 and BA, and you got them. You then manually messed around with the UTF8 flag. Because of your environment, Perl encoded the three-character string as latin-1, so the bytes were E2 98 BA, and so you are lucky. Then you set the UTF8 flag on, and finally you have that code point 263a, but you didn't get it the way you should have. When you print this string, however, there's no conversion going on AT ALL, because you never specified what to convert TO! Perl has no choice but to dump its internal representation to STDOUT, but is friendly enough to warn you that this output may not be what you want, because it doesn't know what you want. We see the warning because it's impossible to translate a code point that high to Latin 1. No, we see the warning because you're printing something that has the UTF8 flag set (and thus with certainty is a text string), to a filehandle that doesn't have an encoding attached to it. I don't want to spend all my time explaining the bottomless intricacies of Unicode handling in Perl to people. Neither do we, but apparently you INSIST that you use the internals directly instead of the way things were intended, so we have to explain to you these bottomless inticacies of Unicode handling in Perl's internals if you're ever to understand what the heck your broken code really does. Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply]
Re^4: Interventionist Unicode Behaviors by creamygoodness (Curate) on Sep 08, 2006 at 12:29 UTC
Re^5: Interventionist Unicode Behaviors by Juerd (Abbot) on Sep 11, 2006 at 09:38 UTC
Some notes below your chosen depth have not been shown here
Re^3: Interventionist Unicode Behaviors by graff (Chancellor) on Sep 08, 2006 at 10:25 UTC
It's trying and failing to convert Unicode code point 0x263a to Latin-1. We see the warning because it's impossible to translate a code point that high to Latin 1. No, that first snippet in the OP is not trying to convert Unicode U263A to Latin-1; the wide-character warning is simply telling you that you have a string that perl is treating as utf8 data, and it's being printed to a file handle that has not been set up for that. You're right that it would be impossible to translate that code point to Latin-1, but since the snippet is not doing that, it's not an issue here. As for your snippet with "$resume", that demonstrates a "feature" of Perl's internal character representation that I was unaware of until now -- thanks for pointing this out. To clarify, the actual byte sequence differs in the two print statements, as follows: `# first output line: 72 c3 a9 73 75 6d c3 a9 0a r c3 a9 s u m c3 a9 nl # second output line: 72 e9 73 75 6d e9 0a r e9 s u m e9 nl` [download] So, the "quasi-ambiguous" nature of bytes/characters in the \x80-\xff (U0080-U00FF) range seems deeper, subtler, more strange than I expected: for this set, perl's internal representation is still single-byte, not really utf8. (Update: Well, on second thought, maybe I'm still not so clear on this myself; the fact that "\xc3\xa9" turns into "\xe9" -- the single-byte Latin-1 é -- because you hit it with the perl-internal "_utf8_on" function and print it to a non-utf8 file handle... that is some heavy-weight voodoo. I'm with Juerd: don't play with _utf8_on -- you should have seen and heeded the warning about that in the Encode docs.) Actually, I had already been aware of that (in some sense), but I had not seen its effect on file output. If you put `binmode STDOUT, ":utf8"` between the two print statements (to coincide with upgrading the string to utf8), the byte sequences of the two outputs would be identical. Now I understand much better what the rationale is behind the "wide character" warnings -- the behavior demonstrated here is a case of character data that ought to be interpreted as utf8 on output (because it has the utf8 flag turned on), but is not being so interpreted (so it comes out as non-utf8 data, i.e. ill-formed/undisplayable). So the basis of this trouble is not specifically PerlIO layers, but rather Perl's current internal representation of this byte/character range, and how that interacts with "default" vs. "utf8" file handles. It's a difficult, tricky situation... As you demonstrate, leaving STDOUT in its default state throughout causes one kind of problem. But if it were set to ":utf8" before the first print statment, the two outputs would again be different, but in a different way, and the first one would be "wrong": `# first line (after binmode STDOUT, ":utf8") 72 c3 83 c2 a9 73 75 6d c3 83 c2 a9 0a r c3 83 c2 a9 s u m c3 83 c2 a9 nl # second line (still ":utf8") 72 c3 a9 73 75 6d c3 a9 0a + r c3 a9 s u m c3 a9 nl` [download] I'd like to spit out scalars flagged as UTF8 by default from KinoSearch... I've got a reply to that elsewhere in this thread. (I'm full of replies tonight, it seems.)	[reply] [d/l] [select]
Re^4: Interventionist Unicode Behaviors by Juerd (Abbot) on Sep 08, 2006 at 10:53 UTC
Re^5: Interventionist Unicode Behaviors by DrHyde (Prior) on Sep 14, 2006 at 10:09 UTC
Some notes below your chosen depth have not been shown here
Re^4: Interventionist Unicode Behaviors by ysth (Canon) on Sep 10, 2006 at 07:15 UTC
Re: Interventionist Unicode Behaviors by ysth (Canon) on Sep 08, 2006 at 06:49 UTC
Another "feature" that's bitten me in the butt is the silent "upgrading" (i.e. corruption) of non-UTF8 scalars when concatenated with UTF8 scalars... Then you want encoding::warnings. Think there's any chance these behaviors could change in Perl 5.10? Is it worth bringing up on p5p? Sure, but keep in mind that a lot of very smart people have put a lot of thought into the existing behavior, and where there are flaws or caveats, alternative behavior was judged to be worse. If you want to propose a change, make sure you think through the possible drawbacks. It sounds like you want the utf8 flag on handles to go away and just have the output encoding depend on perl's internal encoding of the data - this sounds very bad in a number of ways.	[reply]
Re^2: Interventionist Unicode Behaviors by creamygoodness (Curate) on Sep 08, 2006 at 08:07 UTC
Then you want encoding::warnings. I can't foist that on people who use my CPAN library. The problem isn't me -- it's the userbase. Expertise varies. Some are extremely sophisticated. Most aren't. I do like that module, nevertheless. I think its behavior should be rolled into the core warnings pragma. It's probably too late for that now, though. It sounds like you want the utf8 flag on handles to go away and just have the output encoding depend on perl's internal encoding of the data No, that's not an accurate characterization. I would like filehandles -- particularly STDOUT -- to be encoding-agnostic by default. However, it should be possible to turn on encoding enforcement using the current mechanism. -- Marvin Humphrey Rectangular Research ― http://www.rectangular.com	[reply]
Re^3: Interventionist Unicode Behaviors by ysth (Canon) on Sep 08, 2006 at 09:34 UTC
It sounds like you want the utf8 flag on handles to go away and just have the output encoding depend on perl's internal encoding of the data No, that's not an accurate characterization. I would like filehandles -- particularly STDOUT -- to be encoding-agnostic by default. However, it should be possible to turn on encoding enforcement using the current mechanism. It sounded like you meant `perl -we'$_ = "\xb1"; print; utf8::upgrade($_); print'` [download] should output three bytes, not two (if STDOUT is not utf8) or four (if STDOUT is utf8, e.g. with -CO). If that's not your "encoding-agnostic" (that I call "the output depending on perl's internal encoding of the data"), I'm not sure what you mean by "encoding-agnostic".	[reply] [d/l]
Re: Interventionist Unicode Behaviors by graff (Chancellor) on Sep 08, 2006 at 09:44 UTC
(My apologies if I'm really being too much of a pest...) For example, I'd like to have KinoSearch output scalars flagged as UTF-8 by default. (The current working version in my subversion repository handles all text as UTF-8 internally.) But if I do that, then if there are any "wide characters" -- code points above 255 -- in the stream, downstream users will see those "wide character in print" warnings. Well, considering that the current man page for KinoSearch mentions "Full support for 12 Indo-European languages" and shows `( language => 'en' )` a couple of times in the synopsis, I think it behooves you to say something right up front there about what your expections are (and what users should expect) regarding character encoding. It would be fine to tell users that you expect them to provide the module with utf8 data (or at least tell you what the correct encoding is for the input data), and that things could go badly for them if they don't give you that. For the 12 languages you support, it's virtually gauranteed that Perl's standard set of supported encodings covers them all easily, and you can either handle that internally in your module code (assuming people tell you which encoding to use on which data), or else include some brief examples in your tutorials to cover conversion to utf8. If you're going to accept non-unicode data as input for indexing, you should probably expect that those users will want (need) the same encoding for the outputs that you give them -- adjust your modules accordingly so you give back data the same way it is given to you. If you choose to accept only ut8 input, it's okay to tell users that they are going to get utf8 output from you, and they need to handle it (again, a couple lines in a tutorial or synopsis should suffice). With all due respect for an awesome piece of work, I have to say that many of the problems in module-vs-unicode situations arise because the module documentation implies or advertises support for text in multiple languages, but says nothing about character encoding. Why leave people to guess about that, esp. given that it can become so bizarre when it goes awry. While you're at it, it could be helpful to list which 12 of the few hundred current Indo-European languages you support. (For all I know, you might be supporting Catalan, Gaelic, Irish, Hungarian, Finnish and/or Turkish, which actually are not Indo-European... :) Maybe you could even give a few hints about what "support" actually means here -- e.g. whether module behavior adapts in different ways to different languages, and if so, how... (Like, if I say the language is "en", but there happens to be a few Polish docs in there by mistake, does it blow up?)	[reply] [d/l]
Re^2: Interventionist Unicode Behaviors by creamygoodness (Curate) on Sep 08, 2006 at 14:34 UTC
Thanks for the kind words and the constructive criticism. My original goal was to make KinoSearch polymorphic, so that you could use whatever encoding you wanted. This turned out to be a mistake, and now I'm centering on UTF-8. I will add to the documentation that KinoSearch expects either UTF-8 or Latin-1, and that it will translate anything coming in that isn't flagged as UTF-8 working on the assumption that it is Latin-1. By far the simplest and best solution from my standpoint is to always output UTF-8. KinoSearch is large (80 modules) and there are enough points of egress that it would be best if I didn't have to double up code at each of them. That's what I've pretty much settled on. However, a user who wrote up a very nice bug report for me, including a test (!), was confused and concerned by the fact that the version which solved his problem also happened to issue a "Wide character in print" warning. Ergo, this thread. The 12 languages are listed in the docs for KinoSearch::Analysis::PolyAnalyzer. You're right that it could be plainer what they are, and I will either add the list verbatim or a direct link to it from the main KinoSearch documentation page. With regards to "support" for a language, that means a stemmer and a stoplist are available, and the regex-based tokenizer works fine. Throwing an occasional Polish document into an English collection won't cause any problems, just garbage that you'll only see in search results when your luck has gone weird. Finnish is on the list of supported languages, so it looks like I will need to modify the "Indo-European" tag. :) -- Marvin Humphrey Rectangular Research ― http://www.rectangular.com	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.