Wide characters in e-mail

lghansen has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am trying to send a plain text e-mail in perl that has utf-8 characters in it. I don't have any problems with the characters, they are as expected so I know that the utf-8 string is ok, but I can't figure out how to signal the e-mail module so that it works properly with utf-8. If I use Mail::Sendmail, the program halts with the following error: Wide character in subroutine entry at C:/Perl/site/lib/Mail/Sendmail.pm line 237. Here is the test code I have written:

use HTML::Entities;
use strict;

my $input = '&#1053;&#1077; &#1087;&#1088;&#1086;&#1089;&#1090;&#1086;
+';
$input = decode_entities($input);

# Sendmail stuff
my %mail;
$mail{'Content-type'} = 'text/plain; charset="utf-8"';
$mail{smtp} = 'smtp.mycompany.com';
$mail{From} = 'server@mycompany.com';
$mail{To} = 'me@mycompany.com';
$mail{Subject} = 'Sendmail Test';
$mail{Message} = $input;

use Mail::Sendmail;
Mail::Sendmail::sendmail(%mail) || print STDERR $Mail::Sendmail::error
+;
[download]

If I replace the sendmail code with Mail::Sender::Easy, I only get a warning, and the e-mail is fine, but I'd prefer not to have the warning: Wide character in print at C:/Perl/site/lib/Mail/Sender.pm line 1767, <GEN0> line 14.

# Sender setup 
use Mail::Sender::Easy qw(email);
email({
    smtp => 'smtp.mycompany.com',
    from => 'server@mycompany.com',
    to => 'me@mycompany.com',
    subject => 'Sender Test',
    charset => 'utf-8',
    _text => $input,
}) || print STDERR "email() failed: $@";
[download]

Any ideas how to get either module to work without an error or warning? Thanks, Lisa

Comment on Wide characters in e-mail Select or Download Code

Replies are listed 'Best First'.
Re: Wide characters in e-mail by pc88mxer (Vicar) on May 01, 2008 at 17:19 UTC
I think simply encoding `$input` will do the trick: `use Encode; my $bytes = encode('utf8', $input); email( { ... charset => 'utf8', _text => $bytes, }) ...` [download]	[reply] [d/l] [select]
Re^2: Wide characters in e-mail by lghansen (Initiate) on May 02, 2008 at 07:13 UTC
To be honest, this works, but I am confused as to why. According to the documentation for decode_entities the characters are in unicode decode_entities( $string, ... ) This routine replaces HTML entities found in the $string with the corresponding Unicode character. And when I print them as utf-8 to a file or web page they look fine, so maybe decode_entities is not doing something properly?	[reply]
Re^3: Wide characters in e-mail by almut (Canon) on May 02, 2008 at 09:52 UTC
...but I am confused as to why Thing is that the socket which Mail::Sender is printing to, is not set up to handle Perl unicode (UTF-8) strings (as you get back from `decode_entities`). Whenever you print a unicode string (i.e. one that is Perl-internally flagged as unicode with the "utf8" flag on) to a filehandle/socket which is not opened for UTF-8, you'll get the "Wide character in print" warning, if the string does contain 'wide' characters (i.e. codepoint > 255). `Encode::encode('utf8', ...)` essentially removes that utf8 flag, i.e. it encodes the string from the Perl-internal unicode representation into a byte string, which in this case holds the data in its proper UTF-8 encoding, but without the utf8 flag set. That's why you're no longer getting the warning from `Mail::Sender` — because in a byte string, no value is > 255. (As already implied, `Mail::Sender` hasn't been written to accept unicode strings, even if you declare `charset` to be 'utf8'.) If you want to explore this further, you could look into `Mail::Sender`'s `Connect` routine (line 934), where the socket is being opened. If you'd add (for testing purposes) `binmode($s, ":utf8");` [download] before the `return $s;` (`$s` is the socket), my prediction would be that you'd no longer need to `Encode::encode` your `$input`. Just in case you feel like playing around... :)	[reply] [d/l] [select]
Re^4: Wide characters in e-mail by Jenda (Abbot) on May 03, 2008 at 23:54 UTC
Re^5: Wide characters in e-mail by ikegami (Patriarch) on May 04, 2008 at 00:24 UTC
Re^3: Wide characters in e-mail by ikegami (Patriarch) on May 04, 2008 at 00:42 UTC
You seem to think UNICODE and UTF-8 are the same thing. UTF-8 is a way of storing (encoding) the UNICODE characters. To be honest, this works, but I am confused as to why. `decode_entities` returns a string of UNICODE characters. Internally, it can be stored as either iso-latin-1 (`decode_entities('é')`) or UTF-8 (`decode_entities('Ӓ')`). File handles (such as the socket over which the message will be sent) only understand bytes. Characters are turned into bytes by encoding them. That's why `encode` is needed. The warning you were accessing Perl's internal format of the string (which happened to be UTF-8) and that it doesn't like you doing that. Explicitly convert the string of UNICODE characters to a string of UTF-8 bytes. And when I print them as utf-8 to a file or web page they look fine, so maybe decode_entities is not doing something properly? What does "print them as UTF-8" means?	[reply] [d/l] [select]
Re^2: Wide characters in e-mail by lghansen (Initiate) on May 02, 2008 at 11:12 UTC
Thanks, I didn't think that I had to convert the string, because I thought it was already converted based on this line in the HTML:Entities:decode_entities documentation: This routine replaces HTML entities found in the $string with the corresponding Unicode character. I obviously misread it, I get it now.	[reply]