rethaew has asked for the wisdom of the Perl Monks concerning the following question:

Good day, I have been given a large dump of emails (that were originally sms messages) and charged with making them display nicely in HTML. I am using MIME::Parser to parse out the messages, which is working well, except that there are "emojis" that are in the raw email messages that are quoted printable. I need to be able to decode these emojis and map them to images that will display in a browser. I have a list of emojis and their images in unicode form i.e. U+1F601 but I am have trouble recognizing the quoted-printable code in to this unicode format. For example in the raw email message I have this smily face in quoted printable:
=F0=9F=98
I need this to translate in to U+1F601 so I can map to the correct image to display. Right now MIME::Parser is converting in to unrecognizable characters. I am admittedly a novice in terms of character sets and translations. Any advice you can give would be appreciated.

Replies are listed 'Best First'.
Re: Quoted Printable to Unicode or something
by Anonymous Monk on Oct 25, 2013 at 17:46 UTC
Re: Quoted Printable to Unicode or something
by aitap (Curate) on Oct 25, 2013 at 18:36 UTC

    Is there any charset of the message text specified? For example, Content-Type: text/plain; charset="utf-8". The problem is that you have arbitrary bytes encoded as text while you really need unicode characters, not bytes.

    If the encoding is specified, use decode to decode these bytes into characters after decoding them from quoted-printable encoding. If not, you'll have to guess is somehow (I tried utf-8, utf-16(le|be), shift-jis and failed to obtain any sense from the resulting characters).

      Yes Content-Type: text/plain; charset=utf-8 is specified in the message header. Also I apologize I typo-ed in the op, the code I meant to state was:
      =F0=9F=98=B3
      Which would be part of the message body, e.g.
      So I saw Kevin today and he is sooo cute =F0=9F=98=B3
      Where in the original message, this would be a smily face emoji. I am a little unclear on using the decode. Are you saying just to decode the '=F0=9F=98=B3' for the entire message? Can you give an example?

        I tried to decode a MIME-encoded message with MIME::Tools, and got the body decoded from quoted-printable to bytes and accessible via MIME::Body methods. To get the unicode characters I needed to do one more decoding step and decode my message body from bytes to characters using Encode module.

        Approaching your example,

        use MIME::Decoder; use Encode 'decode'; # only for this particular case I will decode QP manually my $d = new MIME::Decoder 'quoted-printable'; # usual way of obtaining bytes decoded from # QP/Base64/7bit/other content-transfer-encodings # is to use MIME::Body methods # encode unicode characters to UTF-8 on printing binmode STDOUT, ":utf8"; # open an in-memory filehandle # since MIME::Decoder only supports filehandles open my $fh, ">", \(my $bytes); # decode the quoted-printable $d->decode(\*DATA, $fh); # decode the bytes my $characters = decode 'utf-8' => $bytes; # prove having 1 character, not 4 bytes while ($characters =~ /(.)/g) { printf "%s is unicode character %x\n",$1,(unpack"W",$1); } __DATA__ =F0=9F=98=B3
        � is unicode character 1f633
        my terminal font doesn't have emoji, so it showed � instead

        More info at perlopen, Encode, perlunitut, perluniintro, perlunifaq.