sapphirecat has asked for the wisdom of the Perl Monks concerning the following question:

Update #2, solved: It turns out that Email::MIME is encoding aware, I just missed it in my haste to use it like MIME::Lite. Also, I stumbled over Email::MIME::CreateHTML - I don't need it myself, but it looks like another interesting choice for the problem of constructing email.

There are a bunch of method pairs of the form foo and foo_str, which do the same thing, except that foo takes an encoded string as-is; in contrast, foo_str takes a decoded string, and encodes it for you (if you have a charset and encoding chosen).

Now, a code demonstration of where I've come to:

#!/usr/bin/perl -w # For correct results, use on a terminal expecting utf-8 use strict; use Email::MIME; sub new_part ($$) { my ($type, $body) = @_; my ($part) = Email::MIME->create( attributes => { content_type => $type, charset => 'utf-8', encoding => '8bit', }, body_str => $body, ); do { $part->header_set($_) } foreach qw/Date MIME-Version/; return $part; } # Normally, this would be actual utf-8 under "use utf8" my $text = "Not latin: \x{30ab}\x{30bf}\x{30ab}\x{30ca}"; my $html = "<p><i>$text</i></p>"; my $m = Email::MIME->create(header_str => [ To => 'a@example.com', From => 'b@example.com', Subject => 'Test', ], attributes => { content_type => 'multipart/alternative', }, parts => [ new_part('text/html', $html), new_part('text/plain', $text) ]); print $m->as_string;

Original question follows:

O Monks, I want to generate MIME email. I want to hand Unicode strings into the generator, and I want to get a byte string (an encoded string) back when I call as_string() or equivalent. I believe this is the only sensible thing to do, since the Content-Type defines a byte encoding of the data, and this byte encoding MAY vary by part, per each part's individual Content-Type specification. Thus, trying to encode a Unicode string returned from as_string() will produce the wrong result.

Neither MIME::Lite (basically deprecated now) nor Email::MIME seem to fit this desire. They issue a "wide char in print" warning on a :raw filehandle for my test with embedded katakana, which means they returned a non-encoded Unicode string, as I understand it. The documentation for MIME::Entity does not look promising either. Is there something else to try, or shall I give up and have callers lovingly byte-encode everything (and set its charset) on the way into Email::MIME? (I haven't even started looking at properly encoding headers like Subject yet; advice there would also be appreciated.)

Some broader context about what I'm trying to achieve, in case I'm doing it beyond wrong: I want to either pass the email to encode_base64 for packing into an Amazon SES API call, or I want to give it to /usr/sbin/sendmail -oi, most likely via :raw filehandle, if SES is over quota. encode_base64 is only defined over byte strings, so I need a correctly-encoded byte string regardless. (I want to use the SES API over their SMTP support so that I can get better errors, and check the quota/rate limit in advance.)

Update: some code follows, per request by anonymous.

#!/usr/bin/perl -W # vim:fileencoding=utf-8 # PuTTY option: Remote character set = UTF-8 # my locale: en_US.UTF-8 (LANG and all LC_* except LC_ALL="") use warnings; use strict; use utf8; use MIME::Lite; use MIME::Base64; my $m = MIME::Lite->new(To => 'a@example.com', From => 'b@example.net', Subject => 'Test', Type => 'TEXT', # Perlmonks safe encoding with same result Data => "Not latin: \x{30ab}\x{30bf}\x{30ab}\x{30ca}\n"); my $s = $m->as_string; print "UTF-8 flag: ", utf8::is_utf8($s), "\n"; binmode(STDOUT, ':raw'); print $s; # warns: wide char in print print encode_base64($s); # dies: wide char in sub entry

If I set stdout to ':utf8', then perl knows how to encode the Unicode string $s for printing, and the warning goes away. If I use Encode; print encode_base64(encode_utf8($s)); then that prevents encode_base64 from dying. However, this would improperly encode any text/* part that had a non-utf8 charset defined, including those parts which have the charset undefined, which is the default.

One last thing: when MIME::Lite talks about "this module will encode your message data for you" it means Content-Transfer-Encoding, binary/7bit/8bit etc. Nothing to do with character encoding.

"Basically, displaying invisible data is not maintainable."

Replies are listed 'Best First'.
Re: Encoding/charset Aware MIME Email Generation
by Anonymous Monk on Jan 18, 2012 at 03:40 UTC
Re: Encoding/charset Aware MIME Email Generation
by Corion (Patriarch) on Jan 18, 2012 at 13:50 UTC

    I would assume that you will have to manually construct each MIME part and encode its body to the encoding you want. Encoding to utf8 (via encode('UTF-8', $payload)) and setting the appropriate Content-Encoding headers should work, because afterwards you only have an octet stream to deal with anymore.

      Thanks. I will do that.

      Upon further reflection, I don't think I can get what I want. Since strings are always characters, no module can tell for certain what it should do with one containing sequences of code points that would be valid utf-8: those sequences can be either multiple latin-1 characters, or an individual character encoded as utf-8. Doing the right thing for knowledgeable users will probably mangle the data of everyone else even more.

      Does that sound like enlightenment, or confusion?

      "Basically, displaying invisible data is not maintainable."