Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

In order to tackle a problem with multi-duplicate emails, I am considering writing a mail file parser to delete the duplicate emails. To avoid re-inventing the wheel, wanted to check if someone has laready written such a script? thanks, carloz
  • Comment on deleting duplicate emails from a mail file

Replies are listed 'Best First'.
Re: deleting duplicate emails from a mail file
by lhoward (Vicar) on Jun 12, 2001 at 19:36 UTC
    I'd suggest something a little more robust than what Chady suggested: Scan through all the emails, computing a checksum (I'd use MD5) of the body and certain header elements (Subject, From, etc..). If any messages have the same checksum they are most likely duplicates. If you look at the entire header, you may miss some duplicates (if the duplicate had a different mail queue ID, or came in at a slightly different time). Storing checksums instead of the entire message will prevent you from having to store all of the content of all the messages (which could be a problem if some of them are large).
      Well, sometimes mails are duplicates, but they have tiny differences between them, like an extra blank line, or a .sig. I've seen this happen frequently with mailing list managers.

      So, perhaps it would be better just to go by the headers...
Re: deleting duplicate emails from a mail file
by Chady (Priest) on Jun 12, 2001 at 19:18 UTC

    yeah.. stuff the emails into keys of a hash... this will remove the dupes, and be sure to lc the whole thing..


    He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.

    Chady | http://chady.net/