How to support Unicode for Embeded Perl

nagamohan_p has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to support Unicode for Embeded Perl by Juerd (Abbot) on Oct 09, 2007 at 11:23 UTC
I have no experience with embedding Perl, but if it is anything like normal Perl, it has Unicode strings, that you can encode to UTF-8 on output. bytes_to_utf8 converts "ASCII" (read: ISO-8859-1) to a Perl Unicode string. If you are absolutely sure that your string is valid UTF-8, you can just set its UTF8 flag. If you're not absolutely sure, decode the string in Perl: $foo = decode("UTF-8", $foo). To learn about Unicode in Perl, please read perlunitut, perlunifaq and perhaps perluniadvice. Good luck! Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply]
Re: How to support Unicode for Embeded Perl by graff (Chancellor) on Oct 09, 2007 at 12:55 UTC
What character encoding are you using when you write your "samp.pl" file? I don't know Japanese, but the only way I could get a seemingly correct display of your 5-character Japanese string was by setting my browser to use Shift-JIS, which is quite different from utf8. I'm puzzled about the output result that you are reporting; it may be that some of the output bytes were not rendered as visible characters, and when you posted that string, some bytes might have been left out. In any case, for a 10-byte (5 shiftjis character) string to become a 28-byte (or longer?) string would probably require more than just the one call to `bytes_to_utf8()`. There may be more problems elsewhere in your code, involving more misunderstandings about character encodings. I also don't know anything about Embedded Perl, so I wouldn't know whether you can `use Encode;` in that environment. If you can, then probably what you need to do is something like: `use Encode; binmode STDOUT, ":utf8"; $_ = decode( "shiftjis", "‚±‚с‚Й‚ї‚Н " ); print;` [download] Note that `Encode::decode( "shiftjis", "..." )` does something very different from what `bytes_to_utf8()` does. The latter (I expect) assumes that the string being passed as input is actually iso-8859-1, and converts it to utf8 accordingly. If the string is actually shiftjis, then the bytes are all being misinterpreted and the result will not be Japanese. You should also make sure that the device you are printing to supports the display of utf8 characters. If it handles shiftjis, maybe you just want to skip the "bytes_to_utf8()" thing. There are very good reasons for converting to utf8, especially when dealing with Asian text data (e.g. it's much better to do regex matches, substitutions, `index(), substr(), lenth()` etc. with character semantics rather than byte semantics), and in general, switching to unicode is just a good idea anyway, but if your display device gives you a choice, and you are just pushing strings to a display, maybe you don't need utf8. It's good to have some diagnostic tools when working with unicode data, to make sure the data really is unicode, and to know what's in it. I've posted a couple of tools here at the Monastery that might be helpful for you: tlu -- TransLiterate Unicode and unichist -- count/summarize characters in data.	[reply] [d/l] [select]
Re^2: How to support Unicode for Embeded Perl by Anonymous Monk on Oct 10, 2007 at 12:28 UTC
Thanq for ur prompt response...What u have seen the characters are not exact(those are Japanese characters of Hello)...this browser doesn't show exactly what i have typed/copied...For getting the right characters please use online translators(http://babelfish.altavista.com/) Please refer the following for "Embeding Perl interpreter in C program" http://72.14.235.104/search?q=cache:ycFwwiAguTgJ:search.cpan.org/perldoc%3Fperlembed+embeded+Perl&hl=en&ct=clnk&cd=3&gl=in we define some API for our application. For print in perl, we have "DispText". In sample.pl contains the following ShowText("‚±‚с‚Й‚ї‚Н"); # print "Hello"; After parsing the pl file, that string should be stored in wchar* variable which will be displayed by my vc++ application.When it is parsed I'm not getting the exact unicode characters through the following functions. bytes_to_utf8((U8*)SvPV(ST(0), Len), &Len); or sv_utf8_upgrade(ST(0)); Regards, nag	[reply]
Re: How to support Unicode for Embeded Perl by graff (Chancellor) on Oct 10, 2007 at 13:18 UTC
If the AnonymousMonk replying to my previous post was you, please pay attention to the following points, to make your use of PerlMonks more effective for everyone concerned: Don't forget to log in on your user account before posting. When someone replies to a node posted by nagamohan_p, you will get notification that you have a reply. If I reply to a post by AnonymousMonk, you don't get notified. Please read Writeup Formatting Tips and What shortcuts can I use for linking to other information? -- in general, browse the sections of the PerlMonks FAQ, especially the sections about Posting and Linking. You need to start using <code> tags for code and data, as well as square brackets for links. When responding to advice, please show some evidence that you have actually tried to follow the advice, and if things are still not working right, show some new details about the problem (a different data sample, results from testing things a different way, etc). If you don't understand something in the advice you have been given, please try to be clear about what it is you don't understand, and ask for clarification. Don't just repeat the original question. As it is, the anonymous post does not contain any information that moves the discussion forward to a solution. Whatever the problem is with your browser in terms of showing you the Japanese word for "Hello", my browser (firefox on macosx) has no problem with it -- it's just that the data, as you posted it, is in Shift-JIS encoding, not utf8. (BTW, the Japanese word in question, when translated to English by babelfish.altavista.com, comes out as "today"; but when they translate English "hello" to Japanese, it comes out as that same word. Go figure.) In case you have not tried using the two diagnostic tools that I cited in my earlier reply, you will probably need to do that before you can make further progress. If the data in your script (or in some data file) really is encoded as utf8, the "tlu" script will show you the unicode hex code points for each character, and you can post the string in that form if you are still having problems. For example, translating English "hello" to Japanese (and piping the result through tlu) yields the following string of unicode characters: `\x{3053}\x{3093}\x{306b}\x{3061}\x{306f}` [download] You can look those up at http://www.unicode.org/charts/ and know for sure that we are talking about right string. If you don't get those code points when you run your code/data through tlu, it means you are not using utf8 encoding, and the "bytes_to_utf8()" function will not help you with that. Update: is this the link you meant to point to, regarding Embedded Perl? ~~Embperl::Intro -- or~~ maybe this one: perlembed	[reply] [d/l]