japanese characters

alexiskb has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: japanese characters by staunch (Pilgrim) on Sep 03, 2002 at 14:05 UTC
I've had success using the the Jcode module when dealing with Japanese characters. Or if you're using Perl 5.8 already, you can take advantage of its Unicode support. Good luck, Staunch	[reply]
Re: japanese characters by BrowserUk (Patriarch) on Sep 03, 2002 at 19:27 UTC
Viewing that string using a full unicode font (code2000.ttf from somewhere on the web..all 2MB of it!), the characters don't appear to be japanese chars at all. Just regular ansi encoded in their utf-16 form. I've posted a .gif of what it looks like on my home node here in case that is of any help. Well It's better than the Abottoire, but Yorkshire!	[reply] [d/l]
Re: japanese characters by theorbtwo (Prior) on Sep 03, 2002 at 23:12 UTC
The first thing I tried was flattening out those HTML-escapes: `s/&(\d+);/chr $1/ge/`. I then realized that printing that to the WinME console wouldn't do much good, I put it into an HTML file, viewed it with Mozilla, and set the character encoding to UTF-8. I got ｓｉａｏｈｔａｌ.ｏm@blahblah.co.jp. And if you can view that, you could have read the original, so that approach isn't much good, but it gives us a starting point. From there, I looked up what unicode.org has to say about the characters. The code charts are orginized by hex codes, so convert one -- the first, 65363, is U+FF53. So I went from there to Code Charts, and found that the greatest starting point less then FF53 was FF00, Halfwidth and Fullwidth Forms, and that FF53 is "FULLWIDTH LATIN SMALL LETTER S", which maps to U+0073, "s". In fact, FF01-FF5E are annotated "see ASCII 0020-007E", and they all map out nicely. Thus, we can take each of the high characters, and map them to their non-wide low eqivelents. The final code is: `$_='&65363;&65353;&65345;&65359;&65352;&65364;&65345;&65356;.&65359;m@ +blahblah.co.jp'; s/&(\d+);/chr($1-0xFF00+0x20)/ge; print $_;` [download] Oh, but we're forgetting somthing: these aren't the only case where you might get `&\d+;` things: `$_='&65363;&65353;&65345;&65359;&65352;&65364;&65345;&65356;.&65359;m@ +blahblah.co.jp'; s/&(\d+);/($1>0xFF00 && $1<0xFF5F) ? chr($1-0xFF00+0x20) : $1/ge; print $_;` [download] Update: This has a bug: it eats some of the ascii at the end of your input string. I'll fix it after dinner. Found it. Oldest in the book: I used double-quotes instead of single-quotes around an email-address, so it tried to interpolate @blahblah... Had I run this with warnings, I would have gotten a nice warning telling me exactly what I was doing wrong. Fixed in the above. Confession: It does an Immortal Body good.	[reply] [d/l] [select]
Re: Re: japanese characters by BrowserUk (Patriarch) on Sep 03, 2002 at 23:36 UTC
And if you can view that, you could have read the original, so that approach isn't much good, That's why I posted a gif of it on my home node. As for attempting to process the string, I was rather concerned that the last character of the email name was a standard ascii char, when all the rest where utf-16 encoded. You can see he has munged the main part of the domain name (for obvious reasons), but that made me wonder about the integrity of the rest of the string. Well It's better than the Abottoire, but Yorkshire!	[reply] [d/l]