Detecting charset in email

tertullian01 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Detecting charset in email by jhourcle (Prior) on Jun 22, 2005 at 23:37 UTC
If there is no character set in the header, you should typically assume that it is ASCII (see section 3.1 of RFC822, or section s2.1 and 2.3 of RFC2822). update: 822, not 8221 (the link was good, though)	[reply]
Re^2: Detecting charset in email by tertullian01 (Acolyte) on Jun 27, 2005 at 14:15 UTC
What are the chances that email servers use ASCII if it is not defined in the header? I have tried sending emails from several email accounts including Hotmail, Gmail, operamail and a university account. Some of them use ASCII and some of them have Unicode encoding but none of them specify what they are using. Do you have any more ideas?	[reply]
Re^3: Detecting charset in email by jhourcle (Prior) on Jun 27, 2005 at 16:02 UTC
The mail servers should pass through anything without significant inspection. (see RFC821 and RFC2821). The problem could either be a misconfigured mail client that generated the message, or that they've hidden the encoding in some other location -- for instance, with MIME (http://www.faqs.org/rfcs/rfc2045.html\|RFC2045]), there's an additional set of headers (RFC2047) that you may need to inspect, particularly if it's a multipart message, as each part may have a seperate encoding. If there's still not an encoding specified, and it's not US-ASCII, then you're dealing with a BROKEN mail client, as they're not conforming to the protocols for generating internet e-mail. You can either inform the manufacturer of the mistake, or you'd have to take a wild guess at what they indended the message to be. (well, it might not be a completely random guess -- you can look to see if they have the headers 'X-Mailer' or 'User-Agent', and try to infer from that, or look through the content for patterns indicating what it might be). There are some older encoding that don't advertise what they are in the headers, because they map within 7bit space (BinHex, uuencode, vCard, PGP, etc.)	[reply]
Re: Detecting charset in email by rlucas (Scribe) on Jun 24, 2005 at 00:48 UTC
You can use Encode::Guess, I believe, which performs a guess. To be a good guess, you have to prime it with the most likely charsets (e.g. if you are expecting English, probably ASCII, ISO-8859-1, etc., while if you are expecting Japanese...). However, here is a shameless plug for a project I'm tangentially working with (though not leading): HEBCI, or HTML Entity Based Codepage Inference, though I often forget that and think it's Heuristic Estimation of Bytes' Charset, Idiomatically, or such. HEBCI is the way to figure out a charset by sending some stuff that comes back differently in different charsets, and checking the differences. HEBCI is an HTML way to do this, but I imagine that you could try the same principle with email, in the event that Encode::Guess doesn't suffice.	[reply]
Re^2: Detecting charset in email by tertullian01 (Acolyte) on Jun 27, 2005 at 13:28 UTC
I have tried to use Encoding::Guess but the server I am using has PERL 5.6.1 and the module requires v 5.7.3. This is the error it returned when I tried to install it: `Perl v5.7.3 required--this is only v5.6.1, stopped at Makefile.PL line + 1. BEGIN failed--compilation aborted at Makefile.PL line 1.` [download]	[reply] [d/l]