suggested:
Scan through all the emails, computing a checksum (I'd use MD5)
of the body and certain header elements (Subject, From, etc..).
If any messages have the same checksum they are most likely duplicates. If
you look at the entire header, you may miss some duplicates (if the duplicate had a different mail queue ID, or
came in at a slightly different time). Storing checksums instead of the entire
message will prevent you from having to store all of the content of all the messages (which could
be a problem if some of them are large).