Errto has asked for the wisdom of the Perl Monks concerning the following question:

Folks,

I'm writing mail-processing programs that need to take incoming emails and store them for later use. The part I'm trying to nail down right now is interpreting Unicode strings correctly. Here's a little snippet of a program to read an incoming message in the usual (RFC822) format:

use strict; use warnings; use Mail::Audit; use MIME::Words qw(decode_mimewords); use Encode; open my $out, '>:utf8', "/some/file.txt"; my $mail = new Mail::Audit; my $subj = $mail->get('Subject'); $subj = decode_utf8 decode_mimewords($subj); print $out "$subj\n";

The part that seems odd to me is the decode_utf8 part. I wasn't expecting to need that. But if I don't have it, I get the wrong output. For example, here is the raw version of the relevant field (the word in question is the Russian word for "Russian", ie. Русский):

Subject: =?UTF-8?B?0KDRg9GB0YHQutC40Lk=?=
If I run my program on it without the decode_utf8 part, I get
Π ΡΡΡΠΊΠΈΠΉ
whereas with it, I get the result I want. Am I doing something wrong here, or do I just misunderstand how these modules work?

Replies are listed 'Best First'.
Re: reading Base64 encoded Unicode email returns mangled text
by ikegami (Patriarch) on Mar 19, 2007 at 04:45 UTC

    The documentation warns against using decode_mimewords in scalar context. In fact, it suggests against using decode_mimewords in favour of unmime in MIME::WordDecoder.

    If you were to use decode_mimewords, seems to me the proper usage would be:

    my $subj = ''; foreach (decode_mimewords($mail->get('Subject'))) { my ($data, $charset) = @$_; if (defined($charset)) { $subj .= decode($charset, $data); } else { $subj .= $data; } }
      I guess that makes sense. I was kind of assuming that there would be some method that would take a MIME-encoded ASCII string, with possibly different components in different encodings, and return me a normal Perl string, with whatever conversions done as needed. Based on earlier experiments it didn't look like MIME::WordDecoder actually handles UTF-8 correctly though I suppose I could try again. Update: Indeed it does not: calling unmime on the string from the OP returns a bunch of question marks.
        The following should work, no matter what encoding was used by the source string.
        MIME::WordDecoder->default(MIME::WordDecoder->supported("UTF-8")); $subj = decode('UTF-8', unmime($subj));

        Update: ug! Don't use MIME::WordDecoder. After looking at its guts, I'd recommend the snippet I presented earlier (packaged as a reusable function below). I think this toolkit predates the addition of UNICODE support to Perl. That would explain why the weird and convoluted interface.

        use MIME::Words qw( decode_mimewords ); use Encode qw( decode ); sub mime_decode { my $decoded = ''; foreach (decode_mimewords($_[0])) { my ($data, $charset) = @$_; if (defined($charset)) { $decoded .= decode($charset, $data); } else { $decoded .= $data; } } return $decoded; } $subj = mime_decode($subj);

        Untested

Re: reading Base64 encoded Unicode email returns mangled text
by GrandFather (Saint) on Mar 19, 2007 at 02:51 UTC

    My reading of the documentation for MIME::Words' decode_mimewords makes me think that all it is doing is the equivalent of entity decoding: it takes some encoded stuff and returns the raw version. There is no implication that it understands UTF-8 so the decode_utf8 that you needed to add makes sense - decode_mimewords is returning the raw UTF-8 byte stream decoded from base64.


    DWIM is Perl's answer to Gödel
Re: reading Base64 encoded Unicode email returns mangled text
by Juerd (Abbot) on Jun 13, 2007 at 19:28 UTC

    Just use the Encode module that comes with Perl. It does the RFC 2047 decoding in one step:

    use Encode; my $russkiy = decode("mime-header", "=?UTF-8?B?0KDRg9GB0YHQutC40Lk=?=" +);
    $russkiy is a proper 7 character Perl unicode string.

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

    PS- You can use Russian variable names too, if you wish:
    
    use utf8;
    use Encode;
    
    my $русский = ...;
    
    PPS- Perl Monks sucks for not allowing Unicode characters in <code>-blocks.