Re: Character encoding in emails

If we add an emoji character to the body text, the queue sends it correctly. But the immediate sending script produces �� instead of the emoji.

There are a few tricks to check if your data is properly encoded. You can find some to test your database and your database connection in the tests t/40UnicodeRoundTrip.t and t/41Unicode.t of DBD::ODBC. They are not specific for DBD::ODBC, they should basically work with every database connection. Both use a little helper module t/UChelp.pm. dumpstr() might be helpful, it shows the length of a string and if its Unicode flag is set.

How the length() trick works: length() always counts characters. If you have the three bytes 0x41, 0x42, 0x43 in RAM for a string, length() is 3, no matter if the Unicode flag is set or not. It's literally just "ABC". If those three bytes are 0xE2, 0x9C, 0x88, and the Unicode flag is not set, length() is 3, the string is "\xE2\x9C\x88" (some three random garbage characters). If, for the same three bytes, the Unicode flag is set, length() is 1, the string is "\x{2708}", or just "✈", a single character encoded in UTF-8.

UChelp::dumpstr() dumps the characters of the string, anything outside the printable ASCII range hex-encoded, plus it dumps length and Unicode flag. That very surely shows you if you get a proper Unicode string, random bytes that just happen to be UTF-8, but don't have the Unicode flag set, or just Mojibake. Trust me, while developing the initial Unicode patch for DBD::ODBC, I got a lot of random bytes and Mojibake and very little proper Unicode strings.

Some trivial things:

Dump %ENV, from CGI, from cron, and from the command line. Search for differences in relevant variables (At least everything containing "PERL", "DBI", or "DBD" in the variable name.) Use Data::Dumper and set $Data::Dumper::Sortkeys=1, then just print Dumper(\%ENV);. I would expect cron environment to be very minimal, CGI the same, but littered with CGI environment variables, and command line to be filled with tons of stuff you will never need.

In Test code (was: Re: Character encoding in emails), I don't see you setting a proper charset ("encoding" in Perl) for the mail. That means some or all of the software handling your mail will resort to guessing the charset or just insisting on some default (probably US-ASCII, ISO-8859-1) which is wrong for your content. Search the MIME:Lite documentation for the word "charset" to see how to properly communicate that your mail content is encodeed in UTF-8. Especially look at "Working with UTF-8 and other character sets". I don't see any of the required code in your example.

Your example lacks the DBI->connect() call. MariaDB can use either DBD::mysql (not recommended) or DBD::MariaDB. DBD::mysql requires extra work to enable MySQL-limited UTF-8 and full UTF-8, whereas DBD::MariaDB tries to be smart and tries to differentiate between strings and binary data. RTFM (search for UTF), show your DBI->connect() call.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Comment on Re: Character encoding in emails Select or Download Code