lghansen has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am trying to send a plain text e-mail in perl that has utf-8 characters in it. I don't have any problems with the characters, they are as expected so I know that the utf-8 string is ok, but I can't figure out how to signal the e-mail module so that it works properly with utf-8. If I use Mail::Sendmail, the program halts with the following error: Wide character in subroutine entry at C:/Perl/site/lib/Mail/Sendmail.pm line 237. Here is the test code I have written:
use HTML::Entities; use strict; my $input = 'Не просто +'; $input = decode_entities($input); # Sendmail stuff my %mail; $mail{'Content-type'} = 'text/plain; charset="utf-8"'; $mail{smtp} = 'smtp.mycompany.com'; $mail{From} = 'server@mycompany.com'; $mail{To} = 'me@mycompany.com'; $mail{Subject} = 'Sendmail Test'; $mail{Message} = $input; use Mail::Sendmail; Mail::Sendmail::sendmail(%mail) || print STDERR $Mail::Sendmail::error +;
If I replace the sendmail code with Mail::Sender::Easy, I only get a warning, and the e-mail is fine, but I'd prefer not to have the warning: Wide character in print at C:/Perl/site/lib/Mail/Sender.pm line 1767, <GEN0> line 14.
# Sender setup use Mail::Sender::Easy qw(email); email({ smtp => 'smtp.mycompany.com', from => 'server@mycompany.com', to => 'me@mycompany.com', subject => 'Sender Test', charset => 'utf-8', _text => $input, }) || print STDERR "email() failed: $@";
Any ideas how to get either module to work without an error or warning? Thanks, Lisa

Replies are listed 'Best First'.
Re: Wide characters in e-mail
by pc88mxer (Vicar) on May 01, 2008 at 17:19 UTC
    I think simply encoding $input will do the trick:
    use Encode; my $bytes = encode('utf8', $input); email( { ... charset => 'utf8', _text => $bytes, }) ...
      To be honest, this works, but I am confused as to why. According to the documentation for decode_entities the characters are in unicode

      decode_entities( $string, ... )

      This routine replaces HTML entities found in the $string with the corresponding Unicode character.

      And when I print them as utf-8 to a file or web page they look fine, so maybe decode_entities is not doing something properly?
        ...but I am confused as to why

        Thing is that the socket which Mail::Sender is printing to, is not set up to handle Perl unicode (UTF-8) strings (as you get back from decode_entities). Whenever you print a unicode string (i.e. one that is Perl-internally flagged as unicode with the "utf8" flag on) to a filehandle/socket which is not opened for UTF-8, you'll get the "Wide character in print" warning, if the string does contain 'wide' characters (i.e. codepoint > 255).

        Encode::encode('utf8', ...) essentially removes that utf8 flag, i.e. it encodes the string from the Perl-internal unicode representation into a byte string, which in this case holds the data in its proper UTF-8 encoding, but without the utf8 flag set. That's why you're no longer getting the warning from Mail::Sender — because in a byte string, no value is > 255.   (As already implied, Mail::Sender hasn't been written to accept unicode strings, even if you declare charset to be 'utf8'.)

        If you want to explore this further, you could look into Mail::Sender's Connect routine (line 934), where the socket is being opened. If you'd add (for testing purposes)

        binmode($s, ":utf8");

        before the return $s; ($s is the socket), my prediction would be that you'd no longer need to Encode::encode your $input. Just in case you feel like playing around... :)

        You seem to think UNICODE and UTF-8 are the same thing. UTF-8 is a way of storing (encoding) the UNICODE characters.

        To be honest, this works, but I am confused as to why.

        decode_entities returns a string of UNICODE characters. Internally, it can be stored as either iso-latin-1 (decode_entities('&eacute;')) or UTF-8 (decode_entities('&#1234;')).

        File handles (such as the socket over which the message will be sent) only understand bytes. Characters are turned into bytes by encoding them. That's why encode is needed.

        The warning you were accessing Perl's internal format of the string (which happened to be UTF-8) and that it doesn't like you doing that. Explicitly convert the string of UNICODE characters to a string of UTF-8 bytes.

        And when I print them as utf-8 to a file or web page they look fine, so maybe decode_entities is not doing something properly?

        What does "print them as UTF-8" means?

      Thanks, I didn't think that I had to convert the string, because I thought it was already converted based on this line in the HTML:Entities:decode_entities documentation:
      This routine replaces HTML entities found in the $string with the corresponding Unicode character.
      I obviously misread it, I get it now.