in reply to Detecting charset in email

If there is no character set in the header, you should typically assume that it is ASCII (see section 3.1 of RFC822, or section s2.1 and 2.3 of RFC2822).

update: 822, not 8221 (the link was good, though)

Replies are listed 'Best First'.
Re^2: Detecting charset in email
by tertullian01 (Acolyte) on Jun 27, 2005 at 14:15 UTC
    What are the chances that email servers use ASCII if it is not defined in the header? I have tried sending emails from several email accounts including Hotmail, Gmail, operamail and a university account. Some of them use ASCII and some of them have Unicode encoding but none of them specify what they are using. Do you have any more ideas?

      The mail servers should pass through anything without significant inspection. (see RFC821 and RFC2821).

      The problem could either be a misconfigured mail client that generated the message, or that they've hidden the encoding in some other location -- for instance, with MIME (http://www.faqs.org/rfcs/rfc2045.html|RFC2045]), there's an additional set of headers (RFC2047) that you may need to inspect, particularly if it's a multipart message, as each part may have a seperate encoding.

      If there's still not an encoding specified, and it's not US-ASCII, then you're dealing with a BROKEN mail client, as they're not conforming to the protocols for generating internet e-mail. You can either inform the manufacturer of the mistake, or you'd have to take a wild guess at what they indended the message to be. (well, it might not be a completely random guess -- you can look to see if they have the headers 'X-Mailer' or 'User-Agent', and try to infer from that, or look through the content for patterns indicating what it might be).

      There are some older encoding that don't advertise what they are in the headers, because they map within 7bit space (BinHex, uuencode, vCard, PGP, etc.)