Parsing out "unique" messages from mbox files

hacker has asked for the wisdom of the Perl Monks concerning the following question:

I've got a large number of mbox-format mailboxes here, accumulated over the years of mail backups upon mail backups. I'm trying to go through them all and remove any duplicate messages that may be in them, before finally archiving them to permanent offline storage. I've been putting them in more granular heirarchies over the years, so there may be ~mail/foo, then later may become ~mail/Projects/foo, which contains mostly the same mail, but could be different.

What I need to do is roll through each one, and yank out the duplicate emails that may have been put there from concatenating them or testing/debugging broken procmail recipes, and store them each in a single mbox file that contains only unique mail.

I've looked into Mail::Box::Mbox and Email::Folder::Mbox, but decided to try to roll my own first. Here's what I've got, and it seems to work so far.

use strict;

undef $/;
my %seen;
my @para          = split /(\n\n+)/, <>;

while (defined($_ = shift @para)) {
    die "No From line!\n" unless /^From /;

    my ($id)      = map /^Message-ID:\s(\S.*)/im, 
                    split /\n(?! )/;

    warn "No Message-ID! [[$_]]\n" unless defined $id;
    $_ .= shift @para while @para and $para[0] !~ /^From /;
    print unless defined $id and $seen{$id}++;
}
[download]

Comments? Improvements?

Comment on Parsing out "unique" messages from mbox files Download Code

Replies are listed 'Best First'.
Re: Parsing out "unique" messages from mbox files by meredith (Friar) on Jun 13, 2003 at 13:34 UTC
Message-IDs may not follow the format you are using, some clients and spambots send broken ones. My approach would be to eat one message at a time, md5 the whole thing, put it in a hash of md5=>message. (You can either check for previous hits, or overwrite the old ones. no real diff;) Then, after eating the source, you can dump everything out with a `print $uniques{$_} foreach (keys %uniques);` Once again, that is just how I would approach it, and it looks like your code will work if the MIDs are correct anyway. HTH, and ++ on the `use strict;` too! (Don't see it much in SoPW:) `mhoward - at - hattmoward.org`	[reply] [d/l]
Re: Re: Parsing out "unique" messages from mbox files by waswas-fng (Curate) on Jun 13, 2003 at 14:09 UTC
Aye not only can message ID be broken, you may find that they are not unique. the md5 aproch listed above will catch any unique message. -Waswas	[reply]
Re: Parsing out "unique" messages from mbox files by valdez (Monsignor) on Jun 13, 2003 at 14:04 UTC
I found a little script I used many years ago to merge some mailboxes; it uses `formail` and computes md5 of every message to check for duplicates: `# shell script for file in inboxes/* ; do echo "removing digests"; rm -f digests/*; echo "removing output/$file"; rm "output/$file"; echo "digesting $file"; cat $file \| formail -ds ./save $file; echo ; done` [download] `# Perl script 'save' #!/usr/bin/perl use strict; use Digest::MD5 qw(md5_hex); use vars qw($data $digest $mbox); $mbox = shift @ARGV; while (<>) { $data .= $_; } $digest = md5_hex($data); if (-e "digests/$digest") { print "!"; } else { open(F, ">>output/$mbox") or die "outbox: $!"; print F $data; close(F); open(G, ">digests/$digest") or die "digests: $!"; print G "\n"; close(G); print "."; }` [download] `formail` was used because I didn't find a solution to split a mailbox using Perl. It will probably fail to detect a duplicate when the same message has different status (new, read, answered, etc). HTH, Valerio	[reply] [d/l] [select]
Re: Parsing out "unique" messages from mbox files by blahblah (Friar) on Jun 13, 2003 at 16:33 UTC
I wouldn't roll my own mbox parser. I would use a CPAN module or if I HAD to, hack OpenWebmail's mbox parsing to do my evil bidding. Parsing mbox is supposed to be fairly simple, but in my experience there are so many wierd variances that I've had much better success using well maintained parsers that are mature.	[reply]
Re: Parsing out "unique" messages from mbox files by Aristotle (Chancellor) on Jun 15, 2003 at 13:53 UTC
I'd merge any mboxes that I can comfortably `procmail` apart again afterwards. Then `formail`'s `-D` switch should get you at least 90% of the way there. Makeshifts last the longest.	[reply]