Character encoding in emails

Bod has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Character encoding in emails by hippo (Archbishop) on Dec 05, 2023 at 22:55 UTC
The two obvious questions: are you encoding the message somehow when you save it to the database and are you decoding the message somehow when retrieving it again? Can you keep round-tripping it without corruption? Then consider sending two identical messages to yourself via the different methods and comparing them, headers included. If they differ in any way other than in timestamps and message IDs then you can analyse that and dig deeper into what's going on. Don't forget the value of an SSCCE. Strip out everything you don't need and see if and when the problem is resolved. If it isn't, you can post the resulting SSCCE here for someone to look at. 🦛	[reply]
Re: Character encoding in emails by jeffenstein (Hermit) on Dec 06, 2023 at 09:00 UTC
`use utf8;` only applies to the source code. Are you (or does the framework) decode the text from the textarea? Are you encoding and setting the character set with MIME::Lite?	[reply] [d/l]
Re: Character encoding in emails by kcott (Archbishop) on Dec 06, 2023 at 23:03 UTC
G'day Bod, This sounds somewhat similar to what was discussed in "Re: uparse - Parse Unicode strings", and subsequently in "Decoding @ARGV [Was: uparse - Parse Unicode strings]". Without any code, that's pure guesswork on my part; but it may be worth a look. It's possible that your cron environment differs from the other one. There was some discussion of locale in the subthreads linked above. That could be another avenue for troubleshooting. — Ken	[reply]
Re: Character encoding in emails by afoken (Chancellor) on Dec 09, 2023 at 11:22 UTC
If we add an emoji character to the body text, the queue sends it correctly. But the immediate sending script produces �� instead of the emoji. There are a few tricks to check if your data is properly encoded. You can find some to test your database and your database connection in the tests t/40UnicodeRoundTrip.t and t/41Unicode.t of DBD::ODBC. They are not specific for DBD::ODBC, they should basically work with every database connection. Both use a little helper module t/UChelp.pm. `dumpstr()` might be helpful, it shows the length of a string and if its Unicode flag is set. How the `length()` trick works: `length()` always counts characters. If you have the three bytes 0x41, 0x42, 0x43 in RAM for a string, `length()` is 3, no matter if the Unicode flag is set or not. It's literally just `"ABC"`. If those three bytes are 0xE2, 0x9C, 0x88, and the Unicode flag is not set, `length()` is 3, the string is `"\xE2\x9C\x88"` (some three random garbage characters). If, for the same three bytes, the Unicode flag is set, `length()` is 1, the string is `"\x{2708}"`, or just "✈", a single character encoded in UTF-8. `UChelp::dumpstr()` dumps the characters of the string, anything outside the printable ASCII range hex-encoded, plus it dumps length and Unicode flag. That very surely shows you if you get a proper Unicode string, random bytes that just happen to be UTF-8, but don't have the Unicode flag set, or just Mojibake. Trust me, while developing the initial Unicode patch for DBD::ODBC, I got a lot of random bytes and Mojibake and very little proper Unicode strings. Some trivial things: Dump %ENV, from CGI, from cron, and from the command line. Search for differences in relevant variables (At least everything containing "PERL", "DBI", or "DBD" in the variable name.) Use Data::Dumper and set `$Data::Dumper::Sortkeys=1`, then just `print Dumper(\%ENV);`. I would expect cron environment to be very minimal, CGI the same, but littered with CGI environment variables, and command line to be filled with tons of stuff you will never need. In Test code (was: Re: Character encoding in emails), I don't see you setting a proper charset ("encoding" in Perl) for the mail. That means some or all of the software handling your mail will resort to guessing the charset or just insisting on some default (probably US-ASCII, ISO-8859-1) which is wrong for your content. Search the MIME:Lite documentation for the word "charset" to see how to properly communicate that your mail content is encodeed in UTF-8. Especially look at "Working with UTF-8 and other character sets". I don't see any of the required code in your example. Your example lacks the `DBI->connect()` call. MariaDB can use either DBD::mysql (not recommended) or DBD::MariaDB. DBD::mysql requires extra work to enable MySQL-limited UTF-8 and full UTF-8, whereas DBD::MariaDB tries to be smart and tries to differentiate between strings and binary data. RTFM (search for UTF), show your `DBI->connect()` call. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re: Character encoding in emails by soonix (Chancellor) on Dec 07, 2023 at 09:26 UTC
you could put some debug information like Re: Perl script works differently from Apache then CMD line into the mail, perhaps also show @INC, to spot differences there.	[reply]
Re: Character encoding in emails by harangzsolt33 (Deacon) on Dec 06, 2023 at 01:54 UTC
So, let me ask this. Did you create a multi-part email message with HTML code and insert something like ☕ ☕ and it came out like a question mark instead? That may be just something on the recipient's end that is preventing the icon from showing up, OR you're using a font in the HTML email that does not have such characters on the recipient's end.	[reply]
Re^2: Character encoding in emails by Bod (Parson) on Dec 07, 2023 at 00:08 UTC
In this case, the code came from ChatGPT, which inserted an emoji into its output. The lines of text from ChatGPT were wrapped in HTML tags and passed to the email sending processes. Identical input text with identical emojis was passed to both processes. That may be just something on the recipient's end For testing purposes, I am the recipient. On my mobile email client and on Outlook for desktop, emails sent through the queue display the emoji and ones sent directly don't.	[reply]
Re^3: Character encoding in emails by NERDVANA (Priest) on Dec 07, 2023 at 23:09 UTC
So, then your questions to answer are: What was the encoding of the data you received from ChatGPT? Did you store it in a unicode-aware database column type and with unicode enabled on the DBI connection? Did you read it back out with unicode enabled on the DBI connection? What was the encoding of the strings containing your HTML? What encoding did your MIME construction code expect? and my standard suggestion is to make sure you decoded bytes into unicode from the moment it enters your program until the time it exits your program (or in this case, until the time that you construct the MIME, because MIME should always be 7-bit ascii)	[reply]
Test code (was: Re: Character encoding in emails) by Bod (Parson) on Dec 09, 2023 at 01:09 UTC
To try and replicate the issue away from the full production scripts, I've written a test with the sending code copied directly from the production code. A few variables have changed names and a couple of other tweaks to make it run but the sending code is as faithful to production as I can get it. The sending mechanism is extremely similar, so it's unlikely to be that. The sending method is toggled in the code below by setting `$METHOD` true or false. Method true stores the message in a database table. The data types and DB engine are the same as the production DB. There's a bit of substitution and then it's sent. Method false bypasses storing and retrieving from the database. It does a similar substitution and sends. Running the script in a web environment sends email, in both cases, without the emojis being properly reproduced. So, I set up CRON to run the script...again, both methods produce emails with corrupted emojis. It appears that it's not the DB that's affecting it, and it's not the CRON vs web environment that's affecting it... I'm finding this really tricky to debug as I cannot think of a way of inspecting what is happening between the data being passed to MIME::Lite and it appearing in my email client. #!/usr/bin/perl use CGI::Carp qw(fatalsToBrowser); use strict; use warnings; use lib "/home/..../..../public_test/cgi-bin"; # Toggle sending method my $METHOD = 0; use MIME::Lite; use incl::HTML; if ($ENV{'GATEWAY_INTERFACE'}) { print "Content-type: text/plain\n\n" ; } $/ = undef; my $subject = 'Testing email'; my $message = <DATA>; my $fname = 'Andrew'; my $sname = 'Test'; my $tomail = 'ian@l******t.co.uk'; if ($METHOD) { $dbh->do("CREATE TEMPORARY TABLE Email_Test ( `from` VARCHAR(100) NULL DEFAULT NULL, `email` VARCHAR(100) NULL DEFAULT NULL, `subject` VARCHAR(100) NULL DEFAULT NULL, `message` BLOB NULL DEFAULT NULL ) ENGINE = MyISAM"); $dbh->do("INSERT INTO Email_Test SET `from` = 'Bod', email = 'ian\ +@boddison.com', subject = ?, message = ?", undef, $subject, $message); my ($from, $email, $subject, $message) = $dbh->selectrow_array("SE +LECT `from`, email, subject, message FROM Email_Test LIMIT 1"); $subject =~ s/%FNAME%/$fname/g; # There are $subject =~ s/%SNAME%/$sname/g; # more of $message =~ s/%FNAME%/$fname/g; # these in $message =~ s/%SNAME%/$sname/g; # production my $mail = MIME::Lite->new( To => "\"$fname\" <$tomail>", From => "\"$from\" <$email>", Subject => "$subject", Type => 'text/html', Data => $message, ); my $check = $mail->send; print "Failed to send to $fname $sname ($tomail)\n" unless ($check +); } else { my $name = "$fname $sname"; $subject=~s/%FNAME%/$fname/g; $subject=~s/%SNAME%/$sname/g; $message=~s/%FNAME%/$fname/g; $message=~s/%SNAME%/$sname/g; my $mime = MIME::Lite->new( From => "\"Ian Boddison\"<ian\@boddison.com>", To => "\"$name\"<$tomail>", Subject => $subject, # Type => 'multipart/alternative', # ); # $mime->attach( # Type => 'text/plain', # Data => $data{'plaintext'}, # ); # $mime->attach( Type => 'text/html', Data => $message, ); my $chk; eval { $chk = $mime->send() }; if ($@) { print "Error - $@\n"; } } print "\nComplete...\n"; __DATA__ <html> <body> <p>🥁 <b>Introducing On Radar ‑ where quir +k meets impact!</b> 🚀</p> </body> </html> [download]	[reply] [d/l] [select]
Re: Test code (was: Re: Character encoding in emails) by afoken (Chancellor) on Dec 09, 2023 at 10:41 UTC
You use $dbh without ever defining or initializing it. So that code can't compile, unless incl::HTML exports $dbh. I can't find incl::HTML on cpan, so this example is not self-contained. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^2: Test code (was: Re: Character encoding in emails) by Bod (Parson) on Dec 09, 2023 at 14:37 UTC
Sorry - you are correct, of course... `incl::HTML` is an internal module that does some of the standard initialisation, including making a connection to the appropriate DB schema and setting `$dbh` as the DBI database handle.	[reply] [d/l] [select]
Re: Test code (was: Re: Character encoding in emails) by NERDVANA (Priest) on Dec 09, 2023 at 10:53 UTC
So... for this example you don't actually have any unicode characters at all, right? Your __DATA__ is using html entities written in ASCII, and all the literals are also ASCII. And the emoji still doesn't come through? For debugging, you can always "view source" in your mail client (well, usually, or maybe you need to install Thunderbird). The source of the email should reveal if something broke your HTML. When you get past this first hurdle though, and go back to using Unicode characters, you should: Decode the CGI parameters into unicode strings Make sure your DBI was connected with option `{ mysql_enable_utf8 => 1 }` Declare your content types as `text/html; charset=UTF-8` or `text/plain; charset=UTF-8` Put `use utf8;` at the top of the script if any of your literals have a non-ASCII character. Encode the headers (if there's any chance they contain non-ASCII)	[reply] [d/l] [select]
Re^2: Test code (was: Re: Character encoding in emails) by afoken (Chancellor) on Dec 09, 2023 at 11:41 UTC
Make sure your DBI was connected with option { mysql_enable_utf8 => 1 } That won't help with DBD::MariaDB, see "UNICODE SUPPORT" in its fine manual. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^3: Test code (was: Re: Character encoding in emails) by Bod (Parson) on Dec 09, 2023 at 14:41 UTC
Re: Character encoding in emails by Polyglot (Chaplain) on Dec 06, 2023 at 03:03 UTC
I assume you are sending to each method from separate webpages. Have you checked that both of the forms are set to accept utf8 encoding, something like this? `<form name="myform" id="myform" accept-charset="utf-8" method="post" a +ction="myperlscript.pl">` [download] Assuming the forms, the HTML pages themselves, etc. all have identical configuration with respect to their meta tags' encoding, etc., you might then look at the difference between where the information is stored in the database, perhaps with the utf8mb4 encoding, and where the text gets passed for processing to be immediately sent. What encodings are involved in the hand-off? One possible "gotcha": Watch out for setting the print handle to "hot", i.e. `$\| = 1;`. This can sometimes have unexpected results. If in doubt, avoid using it. But I would expect that the issue lies closer to what Perl itself is doing with your data. If it sends through the database, it is not outputting the data via STDOUT. The encodings between what goes to the DB versus what goes to STDOUT can, and often do, differ--unless you have explicitly set them the same. As you say the database route is working, it may be already setup correctly (or else it is setup so that the error gets corrected upon retrieval from the DB). Note that `use utf8;` does NOT mean your Perl script will be set for working with utf8 in its input/output; it only means it will be able to save the script itself in utf8 format and be aware of its own encoding, not that of external sources. For other sources, you may wish to use one of the following: `use feature 'unicode_strings'; use open ':encoding(utf8)'; #DEAL WITH ALL FILES IN A UTF8 +WAY binmode STDIN, ':utf8'; #MORE LIBERAL binmode STDOUT, ':utf8'; #MORE LIBERAL # - OR - binmode STDIN, ":encoding(UTF-8)"; #MORE SECURE, ESP. ON READ binmode STDOUT, ":encoding(UTF-8)"; #MORE SECURE, ESP. ON READ` [download] Blessings, ~Polyglot~	[reply] [d/l] [select]