tertullian01 has asked for the wisdom of the Perl Monks concerning the following question:

I need to convert email bodies to UTF8 and to do that it seems that the best choice for a module is Unicode::MapUTF8. However, the bigger problem I have run into is determining what the original encoding is. I have sent an email with the same text from three different accounts using two different browsers and Microsoft outlook. Only one of the emails sent from a univserity email account through firefox had the charset in the header. Is there any other way to determine the charset with perl or does anyone have any ideas?

Replies are listed 'Best First'.
Re: Detecting charset in email
by jhourcle (Prior) on Jun 22, 2005 at 23:37 UTC

    If there is no character set in the header, you should typically assume that it is ASCII (see section 3.1 of RFC822, or section s2.1 and 2.3 of RFC2822).

    update: 822, not 8221 (the link was good, though)

      What are the chances that email servers use ASCII if it is not defined in the header? I have tried sending emails from several email accounts including Hotmail, Gmail, operamail and a university account. Some of them use ASCII and some of them have Unicode encoding but none of them specify what they are using. Do you have any more ideas?

        The mail servers should pass through anything without significant inspection. (see RFC821 and RFC2821).

        The problem could either be a misconfigured mail client that generated the message, or that they've hidden the encoding in some other location -- for instance, with MIME (http://www.faqs.org/rfcs/rfc2045.html|RFC2045]), there's an additional set of headers (RFC2047) that you may need to inspect, particularly if it's a multipart message, as each part may have a seperate encoding.

        If there's still not an encoding specified, and it's not US-ASCII, then you're dealing with a BROKEN mail client, as they're not conforming to the protocols for generating internet e-mail. You can either inform the manufacturer of the mistake, or you'd have to take a wild guess at what they indended the message to be. (well, it might not be a completely random guess -- you can look to see if they have the headers 'X-Mailer' or 'User-Agent', and try to infer from that, or look through the content for patterns indicating what it might be).

        There are some older encoding that don't advertise what they are in the headers, because they map within 7bit space (BinHex, uuencode, vCard, PGP, etc.)

Re: Detecting charset in email
by rlucas (Scribe) on Jun 24, 2005 at 00:48 UTC
    You can use Encode::Guess, I believe, which performs a guess. To be a good guess, you have to prime it with the most likely charsets (e.g. if you are expecting English, probably ASCII, ISO-8859-1, etc., while if you are expecting Japanese...).

    However, here is a shameless plug for a project I'm tangentially working with (though not leading): HEBCI, or HTML Entity Based Codepage Inference, though I often forget that and think it's Heuristic Estimation of Bytes' Charset, Idiomatically, or such.

    HEBCI is the way to figure out a charset by sending some stuff that comes back differently in different charsets, and checking the differences. HEBCI is an HTML way to do this, but I imagine that you could try the same principle with email, in the event that Encode::Guess doesn't suffice.

      I have tried to use Encoding::Guess but the server I am using has PERL 5.6.1 and the module requires v 5.7.3. This is the error it returned when I tried to install it:
      Perl v5.7.3 required--this is only v5.6.1, stopped at Makefile.PL line + 1. BEGIN failed--compilation aborted at Makefile.PL line 1.