tsr2 has asked for the wisdom of the Perl Monks concerning the following question:

I've got Perl 5.6.1, but could probably upgrade if needed.

What I'm trying to do is take an incoming email and break it down into components, which are placed as CDATA sections in an XML tree. I want to be able to decode RFC1522 headers to UTF-8 and reconstitute the original data later.

AFAICT the MIME::Words/MIME::WordDecoders module won't give me a UTF-8 representation of the original text. The best I can get is some ISO-8859 text that has probably thrown away half the text.

Is there any way in Perl of getting a UTF-8 representation of RFC1522 text, which can be reconstituted to produce the original character set later?

Replies are listed 'Best First'.
Re: MIME nasties
by bgreenlee (Friar) on Aug 03, 2004 at 15:45 UTC

    I think MIME::WordDecoder should work:

    decode STRING
    Instance method. Decode a STRING which might contain MIME-encoded components into a local representation (e.g., UTF-8, etc.).

    Brad

      It might just do what I need in Perl 6, but not in Perl 5.

      I have experimented with MIME::WordDecoder. AFAICT in Perl 5 it only allows you to do a destructive mapping onto an 8 bit character set. It loses most of the information and leaves you no way to recreate the original data.

      #!/usr/local/bin/perl5.8.5 use strict; use MIME::WordDecoder; my $field='From: =?koi8-r?B?7s/Xb2PUdSDT1NLBeG/Xwc7J0Q==?= <m2z19uyn1b +@rrpa.com>'; my $decoded = unmime( $field ); print $decoded
      produces
      ./decode.pl ignoring text in character set `KOI8-R' at ./decode.pl line 7 From: 1 <m2z19uyn1b@rrpa.com>

      Essentially it's just thrown away the KOI8-R characters because they don't map onto ISO-8859, or possibly because it just doesn't understand KOI8-R.

      Of course if you can prove me wrong, I would be happy to be corrected. Maybe I just need to invoke it slightly differently?

      I've been experimenting with Python 2.3 and it seems much more capable in this area, so I think I'll have to learn another scripting language :-(

      $ python2.3 Python 2.3.4 (#1, Aug 3 2004, 16:01:36) [GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-110)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from email.Header import Header >>> from email.Header import decode_header >>> h='From: =?koi8-r?B?7s/Xb2PUdSDT1NLBeG/Xwc7J0Q==?= <m2z19uyn1b@rrp +a.com>' >>> decode_header( h ) [('From:', None), ('\xee\xcf\xd7oc\xd4u \xd3\xd4\xd2\xc1xo\xd7\xc1\xce +\xc9\xd1', 'koi8-r'), ('<m2z19uyn1b@rrpa.com>', None)] >>> h = Header('\xee\xcf\xd7oc\xd4u \xd3\xd4\xd2\xc1xo\xd7\xc1\xce\xc9 +\xd1', 'koi8-r') >>> print h =?koi8-r?b?7s/Xb2PUdSDT1NLBeG/Xwc7J0Q==?= >>>

        I see. Sounds like an opportunity for a new (or improved) CPAN module. Then again, I've been looking at Python lately and it seems pretty nice.... ;-)