I'd suggest something a little more robust than what Chady suggested:
Scan through all the emails, computing a checksum (I'd use MD5)
of the body and certain header elements (Subject, From, etc..).
If any messages have the same checksum they are most likely duplicates. If
you look at the entire header, you may miss some duplicates (if the duplicate had a different mail queue ID, or
came in at a slightly different time). Storing checksums instead of the entire
message will prevent you from having to store all of the content of all the messages (which could
be a problem if some of them are large). | [reply] |
Well, sometimes mails are duplicates, but they have tiny differences between them, like an extra blank line, or a .sig. I've seen this happen frequently with mailing list managers.
So, perhaps it would be better just to go by the headers...
| [reply] |
yeah.. stuff the emails into keys of a hash... this will remove the dupes, and be sure to lc the whole thing..
He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.
Chady | http://chady.net/
| [reply] [d/l] |