in reply to Parsing out "unique" messages from mbox files

I found a little script I used many years ago to merge some mailboxes; it uses formail and computes md5 of every message to check for duplicates:

# shell script for file in inboxes/* ; do echo "removing digests"; rm -f digests/*; echo "removing output/$file"; rm "output/$file"; echo "digesting $file"; cat $file | formail -ds ./save $file; echo ; done
# Perl script 'save' #!/usr/bin/perl use strict; use Digest::MD5 qw(md5_hex); use vars qw($data $digest $mbox); $mbox = shift @ARGV; while (<>) { $data .= $_; } $digest = md5_hex($data); if (-e "digests/$digest") { print "!"; } else { open(F, ">>output/$mbox") or die "outbox: $!"; print F $data; close(F); open(G, ">digests/$digest") or die "digests: $!"; print G "\n"; close(G); print "."; }

formail was used because I didn't find a solution to split a mailbox using Perl. It will probably fail to detect a duplicate when the same message has different status (new, read, answered, etc).

HTH, Valerio