reading Base64 encoded Unicode email returns mangled text

Errto has asked for the wisdom of the Perl Monks concerning the following question:

Folks,

I'm writing mail-processing programs that need to take incoming emails and store them for later use. The part I'm trying to nail down right now is interpreting Unicode strings correctly. Here's a little snippet of a program to read an incoming message in the usual (RFC822) format:

use strict;
use warnings;
use Mail::Audit;
use MIME::Words qw(decode_mimewords);
use Encode;

open my $out, '>:utf8', "/some/file.txt";
my $mail = new Mail::Audit;
my $subj = $mail->get('Subject');
$subj = decode_utf8 decode_mimewords($subj);
print $out "$subj\n";
[download]

The part that seems odd to me is the decode_utf8 part. I wasn't expecting to need that. But if I don't have it, I get the wrong output. For example, here is the raw version of the relevant field (the word in question is the Russian word for "Russian", ie. Русский):

Subject: =?UTF-8?B?0KDRg9GB0YHQutC40Lk=?=
[download]

If I run my program on it without the decode_utf8 part, I get

а бббаКаИаЙ
[download]

whereas with it, I get the result I want. Am I doing something wrong here, or do I just misunderstand how these modules work?

Comment on reading Base64 encoded Unicode email returns mangled text Select or Download Code

Replies are listed 'Best First'.
Re: reading Base64 encoded Unicode email returns mangled text by ikegami (Patriarch) on Mar 19, 2007 at 04:45 UTC
The documentation warns against using `decode_mimewords` in scalar context. In fact, it suggests against using `decode_mimewords` in favour of `unmime` in MIME::WordDecoder. If you were to use `decode_mimewords`, seems to me the proper usage would be: `my $subj = ''; foreach (decode_mimewords($mail->get('Subject'))) { my ($data, $charset) = @$_; if (defined($charset)) { $subj .= decode($charset, $data); } else { $subj .= $data; } }` [download]	[reply] [d/l] [select]
Re^2: reading Base64 encoded Unicode email returns mangled text by Errto (Vicar) on Mar 19, 2007 at 15:51 UTC
I guess that makes sense. I was kind of assuming that there would be some method that would take a MIME-encoded ASCII string, with possibly different components in different encodings, and return me a normal Perl string, with whatever conversions done as needed. Based on earlier experiments it didn't look like MIME::WordDecoder actually handles UTF-8 correctly though I suppose I could try again. Update: Indeed it does not: calling `unmime` on the string from the OP returns a bunch of question marks.	[reply] [d/l]
Re^3: reading Base64 encoded Unicode email returns mangled text by ikegami (Patriarch) on Mar 19, 2007 at 17:13 UTC
The following should work, no matter what encoding was used by the source string. `MIME::WordDecoder->default(MIME::WordDecoder->supported("UTF-8")); $subj = decode('UTF-8', unmime($subj));` [download] Update: ug! Don't use MIME::WordDecoder. After looking at its guts, I'd recommend the snippet I presented earlier (packaged as a reusable function below). I think this toolkit predates the addition of UNICODE support to Perl. That would explain why the weird and convoluted interface. `use MIME::Words qw( decode_mimewords ); use Encode qw( decode ); sub mime_decode { my $decoded = ''; foreach (decode_mimewords($_[0])) { my ($data, $charset) = @$_; if (defined($charset)) { $decoded .= decode($charset, $data); } else { $decoded .= $data; } } return $decoded; } $subj = mime_decode($subj);` [download] Untested	[reply] [d/l] [select]
Re: reading Base64 encoded Unicode email returns mangled text by GrandFather (Saint) on Mar 19, 2007 at 02:51 UTC
My reading of the documentation for MIME::Words' decode_mimewords makes me think that all it is doing is the equivalent of entity decoding: it takes some encoded stuff and returns the raw version. There is no implication that it understands UTF-8 so the decode_utf8 that you needed to add makes sense - decode_mimewords is returning the raw UTF-8 byte stream decoded from base64. DWIM is Perl's answer to Gödel	[reply]
Re: reading Base64 encoded Unicode email returns mangled text by Juerd (Abbot) on Jun 13, 2007 at 19:28 UTC
Just use the Encode module that comes with Perl. It does the RFC 2047 decoding in one step: `use Encode; my $russkiy = decode("mime-header", "=?UTF-8?B?0KDRg9GB0YHQutC40Lk=?=" +);` [download] $russkiy is a proper 7 character Perl unicode string. Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' } PS- You can use Russian variable names too, if you wish: `use utf8; use Encode; my $русский = ...;` PPS- Perl Monks sucks for not allowing Unicode characters in <code>-blocks.	[reply] [d/l]