deleting duplicate emails from a mail file

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: deleting duplicate emails from a mail file by lhoward (Vicar) on Jun 12, 2001 at 19:36 UTC
I'd suggest something a little more robust than what Chady suggested: Scan through all the emails, computing a checksum (I'd use MD5) of the body and certain header elements (Subject, From, etc..). If any messages have the same checksum they are most likely duplicates. If you look at the entire header, you may miss some duplicates (if the duplicate had a different mail queue ID, or came in at a slightly different time). Storing checksums instead of the entire message will prevent you from having to store all of the content of all the messages (which could be a problem if some of them are large).	[reply]
Re: Re: deleting duplicate emails from a mail file by asiufy (Monk) on Jun 12, 2001 at 19:51 UTC
Well, sometimes mails are duplicates, but they have tiny differences between them, like an extra blank line, or a .sig. I've seen this happen frequently with mailing list managers. So, perhaps it would be better just to go by the headers...	[reply]
Re: deleting duplicate emails from a mail file by Chady (Priest) on Jun 12, 2001 at 19:18 UTC
yeah.. stuff the emails into keys of a hash... this will remove the dupes, and be sure to `lc` the whole thing.. He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life. Chady \| http://chady.net/	[reply] [d/l]