The main problem of handling unicode is giving code the type of input it expects. Or in other words, keeping track of whether you have a string of bytes or a string of unicode characters.

Perl's MySQL driver is smart enough to know when you are reading a column of characters or a column of bytes. In your case, you are breaking that assumption by having UTF-8 bytes stored in a column that is declared as Latin-1.

The most correct way to fix this is to change the declared type of the database, but as you already know, this can be hard and may have lots of other consequences.

The most correct workaround for the database being declared as the wrong type is to convert the string on the line right after the data enters your program. So, instead of fixing it inside DBI, fix it immediately after DBI.

... my $data= $dbh->selectall_arrayref($sql, { Slice => {} }); for (@$data) { # This is a workaround for the incorrect # character set declared on the table. utf8::decode($_->{my_textcol}); } ...

Leave a comment for your future self or the one who comes after you. Note that I used utf8::decode. This one is built into perl, modifies its argument, and returns false (and doesn't die) if it fails. This way, in case someone in the future does fix the database, then it won't start crashing on you here.

Now you will have true Unicode characters in ->{my_textcol} which you can pass around to anything that expects Unicode strings. This does NOT include any of the system print/write functions. All system input/output must be bytes and will complain or die if you give them wide characters. I presume that 'send_utf8' is a method that expects unicode as input and encodes it to utf8 before writing it to the websocket.

Also, note that your original diagnosis of "it works in the terminal" happened because you read the UTF8 bytes from the database, then wrote them to a terminal that expected to be seeing UTF8 bytes. If you fix your database encoding, the strings would come in as Characters and then generate warnings when you print because you are writing wide characters to STDOUT. Long story short, you just always need to keep track of whether your variables are holding characters or bytes throughout the flow of the program, and knowingly convert them to the correct form before I/O or text processing.


In reply to Re^3: Encoding of emoji character by NERDVANA
in thread Encoding of emoji character by dcunningham

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.