comment on

#!/usr/local/bin/perl

$VERSION = '0.05';

use warnings; use strict;

#  undupe-mail-body - Print only the non duplicate messages based on b
+ody.
#  MODIFIED:  Nov 17 2005

use Digest::MD5;
use Mail::Mbox::MessageParser;

use Pod::Usage;

pod2usage( { '-exitval' => 0 , '-verbose' => 2} )
 if scalar map
            { $_ =~ m/^ -+ (?: \? | h(?:elp)? ) $/ix
              ? 1 : ()
            } @ARGV;

pod2usage( { '-exitval' => 1 , '-verbose' => 0} )
  unless scalar @ARGV;

Mail::Mbox::MessageParser::SETUP_CACHE
( { 'file_name' => '/tmp/cache.mbox-parser' } );

my @mboxes = @ARGV;
foreach my $mb ( @mboxes )
{
  my $parser =
    Mail::Mbox::MessageParser->new
    ( { 'debug' => 1
      , 'enable_cache' => 1
      , 'enable_grep'  => 1
      , 'file_name' => $mb
      }
    );

  do { warn $parser; next; } unless ref $parser;

  #  Save for later to print.
  my $prolog = $parser->prologue();
  my $ordered = 1;
  my $msgs = find_unique( $parser , $ordered);
  print_mail
  ( $ordered ? order($msgs) : [ map $_->[0] , @$msgs ] , $prolog );
}

exit;

#  Return an array reference of array references containing a message 
+& its
#  (optional) order, given Mail::Mbox::MessageParser object & an indic
+ator for
#  ordering request.
sub find_unique
{
  my ($parser , $ordered) = @_;
  my %seen;
  my $count = 0;
  while( ! $parser->end_of_file() )
  {
    my $mail = $parser->read_next_email;
    my $digest = get_body_digest($mail);

    $seen{$digest} = [ $mail , $ordered ? $count++ : () ]
      unless exists $seen{$digest} ;
  }
  return [values %seen];
}

#  Return array reference of ordered messages (as scalar references) g
+iven in
#  an array reference containing a message & order in another array re
+ference.
sub order
{
  my ($messages) = @_;
  return
    [ map $_->[0] , sort { $a->[1] <=> $b->[1] } @$messages ];
}

#  Print messages given as an array reference of scalar references, w/
#  optional prologue.
sub print_mail
{
  my ($mails , $prolog) = @_;
  print $prolog if defined $prolog;
  print $$_ , $/ for @$mails;
}

{ my ( $md5, $start_body );
  #  Return digest of body given a email message.
  sub get_body_digest
  {
    my ($text) = @_;

    $md5 = Digest::MD5->new;

    #  Extract body from a message.
    $start_body = undef;
    while ( $$text =~ m/^(.*)$/mg )
    {
      $start_body = 1 if $1 =~ m/^$/;
      $md5->add( $1 || '' ) if $start_body;
    }

    #  For Debugging.
    #printf STDERR "==>> %s  \%s\n" , $md5->clone->hexdigest , $text;
    #printf STDERR "==>> \%s\n" ,  $text;

    return $md5->hexdigest;
  }
}

__END__

=pod

=head1 NAME

undupe-mail-body - Print only the non duplicate messages based on
body.

=head1 SYNOPSIS

  undupe-mail-body -help

  undupe-mail-body <mbox> [mbox2 , [mbox3 , ... ]]

  undupe-mail-body mbox-with-body-dups > mbox-without-body-dups

=head1 DESCRIPTION

I found at least two ways -- L<procmail(1)> and L<mutt(1)> -- which
can delete duplicate messages based on the C<Message-ID> header.
Failed i to find anything which would delete messages based on
duplicate BODIES.

Given I<mbox>-format mailboxes, this program prints, on I<standard
out>, only those messages which are unique based on only the body.
Original mailbox is accessed only for reading.  Only the first
encountered instance (of multiplicates) is retained.

=head2 Incorrect start of email found

For some of the messages in a mailbox, which otherwise load up fine in
L<mutt(1)>, L<Mail::Mbox::MessageParser> indicates C<Incorrect start
of email found>.  Turning off C<enable_cache> (and C<enable_grep>) on
the first run, or a rerun with C<enable_*> options turned on does not
cause C<Incorrect start of email> to be printed.

So, please do not be alarmed (like i did) if the above happens.

=head1 OPTIONS

=over 2

=item B<help>

Show help message.

=item B<ordered>

Keep the order of output same as input, minus any duplicates.

B<Currently, it is a hard coded value.>

=back

=head1 TO DO

Allow I<ordered> option to be set on command line.

I would like this program to be a filter such that it gathers
the input on I<standard in> in addition.  This can be achieved
by giving C<*STDIN> to C<Mail::Mbox::MessageParser-E<gt>new()>.

=head1 BUGS

=over 2

=item

After building a cache for the first time for a mailbox, same
SCALAR reference is printed (via C<printf STDERR ...> in
get_body_digest()) for all the messages. Any subsequent runs produce
the expected output.

=back

=head1 AUTHOR, LICENSE, DISTRIBUTION, ETC.

Parv, parv_@yahoo.com

Modified:  Nov 17 2005

This software is free to be used in any form only if proper credit is
given.  I am not responsible for any kind of damage or loss.  Use it
at your own risk.

=cut
[download]

In reply to Remove messages w/ duplicate bodies from mbox(es) by parv

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.