Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Decoding UTF-8 charset in an .EML file

by japhy (Canon)
on Nov 15, 2006 at 16:56 UTC ( [id://584212]=perlquestion: print w/replies, xml ) Need Help??

japhy has asked for the wisdom of the Perl Monks concerning the following question:

A co-worker of mine has 600,000+ .EML files which contain some UTF-8 encoded text. Here's a sample:
To: =?utf-8?B?Q2FuY2VsbGF0aW9uIDxDYW5jZWxsYXRpb25Abm9ydmVyZ2VuY2UuY29t +Pg==?= CC: =?utf-8?B?VGltIEhpbGtlIChSVlApIDx0aW0uaGlsa2VAbm9ydmVyZ2VuY2UuY29t +PiwgQmVybmFyZCBDb3lsZSBNVlAgb2 YgREMgPGJlcm5hcmQuY295bGVAbm9ydmVyZ2VuY2UuY29tPiwgRGF2aWQgQnV0bGVyIChB +VlApIDxkYXZpZC5idXRsZXJAbm9ydm VyZ2VuY2UuY29tPiwgVGltb3RoeSBDYXNleSA8dGltb3RoeS5jYXNleUBub3J2ZXJnZW5j +ZS5jb20+?= Microsoft Mail Internet Headers Version 2.0 Received: from SERVER.REMOVED ([IP.RE.MO.VED]) by SERVER.REMOVED with Microsoft SMTPSVC(5.0.2195.6713); Thu, 24 Jun 2004 13:47:45 -0400 X-MimeOLE: Produced By Microsoft Exchange V6.0.6487.1 content-class: urn:content-classes:message MIME-Version: 1.0 Subject: SOME EMAIL SUBJECT Date: Thu, 24 Jun 2004 13:47:44 -0400 Message-ID: <F237EF3B2E3F5E47BAA4F33B79DB0AB502764A95@SERVER.REMOVED> X-MS-Has-Attach: X-MS-TNEF-Correlator: <F237EF3B2E3F5E47BAA4F33B79DB0AB502764A95@SERVER +.REMOVED> Thread-Topic: SOME THREAD TOPIC Thread-Index: AcRaE1qu8p4j/zoYQ4WJdDsPJbcodg== From: "NAME REMOVED" <NAME.REMOVED@EMAIL.REMOVED Return-Path: NAME.REMOVED@EMAIL.REMOVED
He'd like to decode the To: and Cc: lines. What tools are necessary to do this?
Update: Upon examining MIME::Parser, I've tracked down the necessary code in MIME::Words. My solution is:
#!/usr/bin/perl use MIME::Words 'decode_mimewords'; use strict; use warnings; my @files = @ARGV; for my $file (@files) { local @ARGV = ($file); local $^I = ".bak"; local $/; print scalar decode_mimewords($_) while <>; }

Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

Replies are listed 'Best First'.
Re: Decoding UTF-8 charset in an .EML file
by Melly (Chaplain) on Nov 15, 2006 at 19:33 UTC

    Encoding questions seem to be coming up more and more lately - does anyone know of a module or method that can quickly tell you what coding a particular string (probably) uses?

    Tom Melly, tom@tomandlu.co.uk
Re: Decoding UTF-8 charset in an .EML file (RTFBP)
by tye (Sage) on Nov 15, 2006 at 17:08 UTC

    Please attach the code that you've tried so far. Please see these links to "how to not ask a question", "how to ask questions somewhat properly in case you are an idiot", "how to format posts" (for when you post this code), and "guide to the monastery". Thanks.

    If this is homework, please read these nodes about how stupid you are for posting a homework question here:

    Update: Oh, I forgot to make those links. I don't have those handy because I don't post these types of nodes over and over again. Oh well, someone else will surely reply with those links shortly. Please be patient. (:

    Sarcasm. Its not just for breakfast anymore. - tye        

      Have I been absent from the Monastery for too long?
      • I have tried no code, because I don't know what modules are appropriate for the conversion task. If you'll re-read my post, that's the question I asked: "What tools are necessary to do this?"
      • I have supplied sample data and stated my goal.
      • This is not homework, this is a task for a co-worker. He is not a Perl programmer, but Perl seems to be an appropriate tool to get this done. However, I have never had to deal with this sort of process before and need a starting point.
      I'm not sure if the "Sarcasm..." is your signature or if it was attached specifically to your reply.

      Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
      How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

        So you're not even going to try ? I mean, I got it on the first guess.

        MIME::Base64 is most of the solution, it appears. But that really was just a guess (but it produces 'Cancellation <Cancellation@norvergence.com>' from your example).

        Update: Note that the CC list is even more "revealing", especially considering all of the "RE.MO.VI.NG" you did. (:

        (note the (unmodified) sig) I thought you're "lack of effort" deserved some ribbing, as did (even more so) the all-too-common response to "lack of effort" around here. Enjoy.

        - tye        

        Frankly, I appreciate questions like this. They sparc my curiosity, so I go learn a little something. Seems to be a lot of sarcasm aimed at folks that ask questions here recently. Some of it a little more pointed than required it seems to me.

        Anyway, thanks for the question, I learned something from it. :-)

        ...the majority is always wrong, and always the last to know about it...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://584212]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-03-28 23:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found