in reply to Re: Wide characters in e-mail
in thread Wide characters in e-mail

To be honest, this works, but I am confused as to why. According to the documentation for decode_entities the characters are in unicode

decode_entities( $string, ... )

This routine replaces HTML entities found in the $string with the corresponding Unicode character.

And when I print them as utf-8 to a file or web page they look fine, so maybe decode_entities is not doing something properly?

Replies are listed 'Best First'.
Re^3: Wide characters in e-mail
by almut (Canon) on May 02, 2008 at 09:52 UTC
    ...but I am confused as to why

    Thing is that the socket which Mail::Sender is printing to, is not set up to handle Perl unicode (UTF-8) strings (as you get back from decode_entities). Whenever you print a unicode string (i.e. one that is Perl-internally flagged as unicode with the "utf8" flag on) to a filehandle/socket which is not opened for UTF-8, you'll get the "Wide character in print" warning, if the string does contain 'wide' characters (i.e. codepoint > 255).

    Encode::encode('utf8', ...) essentially removes that utf8 flag, i.e. it encodes the string from the Perl-internal unicode representation into a byte string, which in this case holds the data in its proper UTF-8 encoding, but without the utf8 flag set. That's why you're no longer getting the warning from Mail::Sender — because in a byte string, no value is > 255.   (As already implied, Mail::Sender hasn't been written to accept unicode strings, even if you declare charset to be 'utf8'.)

    If you want to explore this further, you could look into Mail::Sender's Connect routine (line 934), where the socket is being opened. If you'd add (for testing purposes)

    binmode($s, ":utf8");

    before the return $s; ($s is the socket), my prediction would be that you'd no longer need to Encode::encode your $input. Just in case you feel like playing around... :)

      The problem with this solution is that it breaks emails that are not UTF. If I send a message with some non ASCII Latin1 (well, windows1252) characters, with this binmode() I receive them converted to UTF-8.

      So I guess, I should binmode($s, ":utf8"); only for the UTF-8 body of the message or the UTF-8 message part. And turn it back to binary($s); afterwards. Though I'm afraid of what it would do if someone did the encode('utf8', ...) on the text before turning it to Mail::Sender :-(

      So I'm afraid of making that change.

        If I send a message with some non ASCII Latin1 (well, windows1252) characters,

        You can't do that given $mail{'Content-type'} = 'text/plain; charset="utf-8"';. It would be like Verizon quoting a price of 0.002 *cents* per kilobyte but charging you 0.002 *dollars* per kilobyte. (story) You can't tell the client you're using one encoding and but actually use another.

        If you use $mail{'Content-type'} = 'text/plain; charset="UTF-8"';, then you'd use binmode($s, ":encoding(UTF-8)"); or encode("UTF-8", $text).

        If you use $mail{'Content-type'} = 'text/plain; charset="cp1252"';, then you'd use binmode($s, ":encoding(cp1252)"); or encode("cp1252", $text).

        Though I'm afraid of what it would do if someone did the encode('utf8', ...) on the text before turning it to Mail::Sender :-(

        It would produce junk. You can't use both encode($encoding, ) and binmode(, ":encoding($encoding)").

Re^3: Wide characters in e-mail
by ikegami (Patriarch) on May 04, 2008 at 00:42 UTC

    You seem to think UNICODE and UTF-8 are the same thing. UTF-8 is a way of storing (encoding) the UNICODE characters.

    To be honest, this works, but I am confused as to why.

    decode_entities returns a string of UNICODE characters. Internally, it can be stored as either iso-latin-1 (decode_entities('é')) or UTF-8 (decode_entities('Ӓ')).

    File handles (such as the socket over which the message will be sent) only understand bytes. Characters are turned into bytes by encoding them. That's why encode is needed.

    The warning you were accessing Perl's internal format of the string (which happened to be UTF-8) and that it doesn't like you doing that. Explicitly convert the string of UNICODE characters to a string of UTF-8 bytes.

    And when I print them as utf-8 to a file or web page they look fine, so maybe decode_entities is not doing something properly?

    What does "print them as UTF-8" means?