Unicode problem

edis has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Unicode problem by idsfa (Vicar) on Jun 16, 2005 at 05:48 UTC
Make sure that you have the encoding you are looking for: `use Encode; print join ('\n', Encode->encodings());` [download] You might also want to check the value of: `is_utf8($_);` [download] Finally, it cannot hurt anything but performance to do: `$uni = encode_utf8($_);` [download] The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. -- Cyrus H. Gordon	[reply] [d/l] [select]
Re: Unicode problem by mugwumpjism (Hermit) on Jun 16, 2005 at 21:50 UTC
I found Unicode-MapUTF8 useful for explicitly converting between arbitrary character sets and UTF-8. $h=$ENV{HOME};my@q=split/\n\n/,`cat $h/.quotes`;$s="$h/." ."signature";$t=`cat $s`;print$t,"\n",$q[rand($#q)],"\n"; [download]	[reply] [d/l]
Re: Unicode problem by graff (Chancellor) on Jun 17, 2005 at 03:25 UTC
I have a dumb question (but who knows)... Are you really positive that "file.txt" actually contains at least one occurrence of "short_japanese_text_in_utf8", and that is really is encoded in sjis? (And are you being careful about checking for distinct white-space characters that might mess up the comparisons?) I'm not familiar with Japanese, but I wonder if there might be a problem because of the way Unicode handles this language. Because Japanese and Chinese (and Korean) use a number of "common" ideographic characters, Unicode has created a "unified CJK" character set, which represents, roughly, the union of ideographs used in the three languages. It's conceivable that some "ambiguities" might exist in the sjis-unicode mappings, such that one utf code point could reasonbly be used in place of two distinct sjis code points, or vice-versa. This could mean that two strings would look "the same" when viewed by a casual human reader, despite the use of one or another distinct code point.	[reply]
Re: Unicode problem by dakkar (Hermit) on Jun 19, 2005 at 09:55 UTC
Works for me, Perl 5.8.2 on Linux. I wrote: #!/usr/bin/perl use utf8; open my $fh,'<:encoding(sjis)','j.sjis.txt'; my $string='言葉'; print index(<$fh>,$string); saving it as '`j.pl`' as utf-8, and 日本語の言葉 saving it as '`j.sjis.txt`' as shift-jis. Running it: $ perl j.pl j.sjis.txt 4 (yes, I used <pre> instead of <code>... but <code> would not let me use character entities, and without the proper characters this answer would be useless) -- dakkar - Mobilis in mobile Most of my code is tested... Perl is strongly typed, it just has very few types (Dan)	[reply]
Re^2: Unicode problem by edis (Acolyte) on Jun 20, 2005 at 01:38 UTC
Sorry, it was my fault, the short lookup string was incorrect (I can't read Japanese, but need to work with it). Everything works fine now. Thanks all for the help. Edvinas	[reply]