What character encoding are you using when you write your "samp.pl" file? I don't know Japanese, but the only way I could get a seemingly correct display of your 5-character Japanese string was by setting my browser to use Shift-JIS, which is quite different from utf8.

I'm puzzled about the output result that you are reporting; it may be that some of the output bytes were not rendered as visible characters, and when you posted that string, some bytes might have been left out. In any case, for a 10-byte (5 shiftjis character) string to become a 28-byte (or longer?) string would probably require more than just the one call to bytes_to_utf8(). There may be more problems elsewhere in your code, involving more misunderstandings about character encodings.

I also don't know anything about Embedded Perl, so I wouldn't know whether you can  use Encode; in that environment. If you can, then probably what you need to do is something like:

use Encode; binmode STDOUT, ":utf8"; $_ = decode( "shiftjis", "こんにちは " ); print;
Note that Encode::decode( "shiftjis", "..." ) does something very different from what bytes_to_utf8() does. The latter (I expect) assumes that the string being passed as input is actually iso-8859-1, and converts it to utf8 accordingly. If the string is actually shiftjis, then the bytes are all being misinterpreted and the result will not be Japanese.

You should also make sure that the device you are printing to supports the display of utf8 characters. If it handles shiftjis, maybe you just want to skip the "bytes_to_utf8()" thing. There are very good reasons for converting to utf8, especially when dealing with Asian text data (e.g. it's much better to do regex matches, substitutions, index(), substr(), lenth() etc. with character semantics rather than byte semantics), and in general, switching to unicode is just a good idea anyway, but if your display device gives you a choice, and you are just pushing strings to a display, maybe you don't need utf8.

It's good to have some diagnostic tools when working with unicode data, to make sure the data really is unicode, and to know what's in it. I've posted a couple of tools here at the Monastery that might be helpful for you: tlu -- TransLiterate Unicode and unichist -- count/summarize characters in data.


In reply to Re: How to support Unicode for Embeded Perl by graff
in thread How to support Unicode for Embeded Perl by nagamohan_p

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.