in reply to Parsing email files, what are the best modules ?

You could have a look at Mail::Box module from CPAN. I assume that the mail folders are in the format of unix mboxes, ascii-mode, line-by-line.

I have started on a simple perl app to do what you have described...
#C:\Perl\bin\Perl.exe -w use strict; use IO::File; use Data::Dumper; use Mail::Box; # Load mail list my $MailList = load_mail_list('./list25B6.txt'); print Dumper($MailList); # Load folder list my $MailFolder = load_mail_folders('./hierarch.txt'); print Dumper($MailFolder); # Parse folder files foreach (values %{$MailFolder}) { parse_mail_folder($_); } sub parse_mail_folder { # to be completed when I get back home... } sub load_mail_list { my $filename = shift; my $f = new IO::File $filename, "r" or die "Can not open mail list +"; my %mlist; # load the header chomp($mlist{title} = <$f>); chomp($mlist{sender} = <$f>); chomp($mlist{nosig} = <$f>); # load the rest of the email addresses my %MailAddress; while (<$f>) { chomp; my ($name, $email) = /^(.*)\s+<(.*)>$/; next if $email eq ''; $MailAddress{$email} = $name; } $mlist{mlist} = \%MailAddress; return \%mlist; } sub load_mail_folders { my $filename = shift; my $f = new IO::File $filename, "r" or die "Can not open mail fold +er list"; my %mbox; while (<$f>) { chomp; next unless ( $_ ne '' and m/^0,0,/ ); s/"//g; my @fld = split /,/; my $folder = (split /:/, $fld[2])[2]; # capture 3rd field $mbox{$fld[-1]} = "D:/Pmail/mail/$folder.PPM"; # full path to +mboxes } return \%mbox; }
And the output so far...
$VAR1 = { 'title' => '\\TITLE Email Distribution', 'nosig' => '\\NOSIG Y', 'mlist' => { 'jbarker@someotherdomain.org' => 'David & Jan B +arker', 'johnarnold@somedomain.com' => 'John & Jenny Ar +nold' }, 'sender' => '\\SENDER Peter Rabbitt <peterr@example.com>' }; $VAR1 = { 'Main' => 'D:/Pmail/mail/FOL07093.PPM', 'Microsoft' => 'D:/Pmail/mail/FOL024EB.PPM', 'Sent' => 'D:/Pmail/mail/FOL04816.PPM' };

Replies are listed 'Best First'.
Re: Re: Parsing email files, what are the best modules ?
by peterr (Scribe) on Nov 10, 2003 at 11:25 UTC
    Hi Roger,

    Many thanks for that example you posted.

    I assume that the mail folders are in the format of unix mboxes, ascii-mode, line-by-line.

    Yes, they are ascii-mode, with CR/LF's. When I tried the script, a message:

    D:\Perl\myscripts>\perl\bin\perl.exe checke~1.pl Can't locate Mail/Box.pm in @INC (@INC contains: D:/Perl/lib D:/Perl/s +ite/lib .) at checke~1.pl line 5. BEGIN failed--compilation aborted at checke~1.pl line 5.

    I then checked for Mail::Box, and it didn't appear to be part of the Active State Perl I have installed. I then used the PPM (version 3.0.1) and did an "install Mail::Box" command, it took about 10 mins, but said everything was okay. However, the same error message appeared.

    The PPM search for Mail::Box displayed

    ppm> search Mail::Box Searching in Active Repositories 1. Mail-Box-Parser-C [3.003] C parser for Mail::Box
    and that is all that was installed, just a file called "C.pm" in a folder D:\Perl\site\lib\Mail\Box\. I did download the file http://perl.overmeer.net/mailbox/source/source-current.tar.gz , and there is a Box.Pm in that file, but I don't know where to put it. No doubt somehow I should reference the 'tar' file in the PPM, for the install ? When I do a SET command at DOS, there are no environment variables for Perl ?

    I'm trying to read up more on the documentation also.

    Many thanks,

    Peter

        Hi,

        Thanks, I have read most of that now, installed 'nmake', and have been using the

        perl -MCPAN -e "shell"

        to install the Mail::Box modules and some others.

        Peter

      Hi Peter, you could read the PPM documentation like what the Anonymous Monk has suggested. Also you probably need all the Mail::Box and its derived modules as well.

      I have complete the code I started earlier. The additional code is an example on the kind of thing you could do with the Mail::Box::Manager module. Pretty handy I think.
      #C:\Perl\bin\Perl.exe -w use strict; use IO::File; use Data::Dumper; use Mail::Box; use Mail::Box::Manager; # Load mail list my $MailList = load_mail_list('./list25B6.txt'); print Dumper($MailList); # Load folder list my $MailFolder = load_mail_folders('./hierarch.txt'); print Dumper($MailFolder); # Parse folder files foreach (values %{$MailFolder}) { parse_mail_folder($_); } # Optionally output $MailList into another file, etc. # And other things ... exit(0); sub parse_mail_folder { my $folder_file = shift; my $mgr = Mail::Box::Manager->new(); my $folder = $mgr->open($folder_file); my @email_addr; foreach my $message ($folder->messages) { my $dest = $message->get('To'); # retrieve the To-address @email_addr = split /,/, $dest; # retrieve multiple addresses # assume the email address format is as follows - # # John & Jenny Arnold <johnarnold@somedomain.com> # # you have to tweak a bit if the format is not as expected # or use the Mail::Address module to do the trick - to # convert the mail address into its canonical form. foreach (@email_addr) { my ($name, $addr) = /(.*)<(.*)>/; $name = s/^\s+//g; # trim spaces at front $name = s/\s+$//g; # trim spaces at rear $addr = s/^\s+//g; # trim spaces at front $addr = s/\s+$//g; # trim spaces at rear if (! exists $MailList->{$addr}) { # ok, we haven't seen this Email address yet $MailList->{$addr} = $name; # and do other things } } } $folder->close; } sub load_mail_list { my $filename = shift; my $f = new IO::File $filename, "r" or die "Can not open mail list +"; my %mlist; # load the header chomp($mlist{title} = <$f>); chomp($mlist{sender} = <$f>); chomp($mlist{nosig} = <$f>); <$f>; # load the rest of the email addresses my %MailAddress; while (<$f>) { chomp; my ($name, $email) = /^(.*)\s+<(.*)>$/; next if $email eq ''; $MailAddress{$email} = $name; } $mlist{mlist} = \%MailAddress; return \%mlist; } sub load_mail_folders { my $filename = shift; my $f = new IO::File $filename, "r" or die "Can not open mail list +"; my %mbox; while (<$f>) { chomp; next unless ( $_ ne '' and m/^0,0,/ ); s/"//g; my @fld = split /,/; my ($folder) = $fld[2] =~ /.*:.*:(.*)/; $mbox{$fld[-1]} = "D:/Pmail/mail/$folder.PPM"; # full path to +mboxes } return \%mbox; }
        Hi Roger,

        Also you probably need all the Mail::Box and its derived modules as well.

        I certainly got plenty of these messages

        Warning: prerequisite Scalar::Util failed to load: Can't locate Scalar +/Util.pm in @INC (@INC contains: D:/Perl/lib D:/Perl/site/lib .) at ( +eval 46) line 3. Warning: prerequisite Test::Harness 1.38 not found at D:/Perl/lib/ExtU +tils/MakeMaker.pm line 343.

        when running the Makefile.pl from the Mail::Box tar/archive. The problem is, I used

        perl -MCPAN -e "shell" cpan> install Scalar::Util

        and the error message still appeared, even though the install went okay ? Even re-installing Mail::Box

        D:\Perl\myscripts>\perl\bin\perl.exe -MCPAN -e "shell" cpan shell -- CPAN exploration and modules installation (v1.59_54) ReadLine support available (try 'install Bundle::CPAN') cpan> install Mail::Box CPAN: Storable loaded ok Going to read \.cpan\Metadata Database was generated on Tue, 11 Nov 2003 00:45:51 GMT Mail::Box is up to date. cpan> q Lockfile removed.

        and then running the Perl script, still gave the following

        D:\Perl\myscripts>\perl\bin\perl.exe checke~1.pl Can't locate Scalar/Util.pm in @INC (@INC contains: D:/Perl/lib D:/Per +l/site/lib .) at D:/Perl/site/lib/Mail/Reporter.pm line 9. BEGIN failed--compilation aborted at D:/Perl/site/lib/Mail/Reporter.pm + line 9. Compilation failed in require at (eval 1) line 3. ...propagated at D:/Perl/lib/base.pm line 62. BEGIN failed--compilation aborted at D:/Perl/site/lib/Mail/Box.pm line + 8. Compilation failed in require at checke~1.pl line 5. BEGIN failed--compilation aborted at checke~1.pl line 5.

        I have checked out all the "prerequisite" warning messages, made a note of those modules, then used the 'MCPAN' / shell to install them. The install appears to go okay, it goes out to the internet , parses through files on FTP sites, and says _that_ module has installed okay. ??

        Going back to where I think (but don't really know) where the perl script is stopping, is line 9 of Reporter.pm , which has

        Use Scalar::Util 'dualvar';

        and I know I have installed _that_ module. The other related code from the error messages are

        # msg - "...propagated at D:/Perl/lib/base.pm line 62." die if $@ && $@ !~ /^Can't locate .*? at \(eval /; # msg - "compilation aborted at D:/Perl/site/lib/Mail/Box.pm line 8." use base 'Mail::Reporter';

        I'm just about all debugged out, and have run out of clues.

        I have complete the code I started earlier. The additional code is an example on the kind of thing you could do with the Mail::Box::Manager module. Pretty handy I think.

        Thanks very much for that additional code, Roger. I guess the big question is, what is different on your Perl setup to mine ??

        Thanks a lot, :)

        Peter

Re: Re: Parsing email files, what are the best modules ?
by Anonymous Monk on Nov 12, 2003 at 11:36 UTC

    I never use Windows, so cannot help you with the installation. However, I know that the tests produce many errors and warnings which can be ignored: the Windows users of MailBox seem unable to help me with real fixes for the tests.

    For your implementation, I advice one of these two approached: use Mail::Message->build() (look for the details of build in this module by selecting the "methods sorted alphabetically" in the right column).

    The other approach may be much simpler: first reconstruct your data into a MIME compliant message, and then call Mail::Message->read($m).

    More help available at the mailbox mailinglist.

    By the way: best way to parse e-mail addresses from a header line is like this:

       my $msg = Mail::Message->read($data);
       my @addr = $msg->get('To')->addresses;
    
    The addresses are Mail::Address objects, which are relatively smart. Parsing addresses in reality is a very complicated task.
    Mark Overmeer.

      Hi Mark,

      I never use Windows, so cannot help you with the installation. However, I know that the tests produce many errors and warnings which can be ignored: the Windows users of MailBox seem unable to help me with real fixes for the tests.

      I have only previously used Perl on a Linux box, and apart from my ignorance of Perl, there have really been no problems. Using it on Windows has been SO different, lots of problems, but I wanted to also use it on Win to improve my Perl skills, plus I save on bandwidth. There are many ad-hoc things I would have previously done in Clipper, but I can see how much more powerful, for tasks of this nature, Perl is.

      In regards to getting it all going on Win, fortunately it is all sorted out now. I removed everything, re-installed ActiveState Perl 5.8.0.806, and then tried the sample code that Roger kindly supplied. There were so many error messages (not from Rogers code), that I added the CGI::Carp module to log all the messages out to a file. That was very handy. Then as I found out the cause (looking in various .PM files), I first attempted to install the missing modules by using PPM. However with only two standard repositories and the trouble I had with referencing either local or remote 'repositories', I decided the only (best) option for the missing modules was to download the entire ...tar.gz file in each case, read the 'install/redame' then do the actual install. All of them worked fine, so the underlying problem in using the code supplied was not the code itself, but module dependency. The modules I had to install manually (makefile, ,etc,etc) were:

      IO

      IO-stringy

      Mail-Box

      MailTools

      TimeDate

      then the Perl script worked. There is only one minor problem, where it is not jumping into a 'foreach' loop, but I _think_ that is because I need to do a bit more research on the 'type' of mailbox being opened. :)

      For your implementation, I advice one of these two approached: use Mail::Message->build() (look for the details of build in this module by selecting the "methods sorted alphabetically" in the right column).

      Thanks for that, I did have a look. In the current task/probem, I'm wanting to read multilple email boxes and check that I have distribution lists up to date. However, I will certainly come back to the "build" because another task is to fix the problems I'm having with using Net::SMTP, so possibly I could look at using Mail::Message instead.

      The other approach may be much simpler: first reconstruct your data into a MIME compliant message, and then call Mail::Message->read($m)

      That could be a good solution for this problem, because the mailboxes are not what I would call 'standard', if standard means anything of a *nix flavour. I notice when I go to add a new mailbox under Pegasus mail, there is an option to create it in either:

      Pegasus Mail v2.X

      Unix Mailbox format

      unfortunately, the default is the first one, so I think the only remaining problem with using Mail::Box on a Win box is for me to get into a hexviewer and see what is there that is so different to a *nix mailbox. It may be better to just open each mailbox as an ascii file, and then re-create it, for temporary purposes, as a Unix format. The other two subs in Rogers code work perfectly ("load_mail_list" and "load_mail_folders"), but I have a feeling the only reason the 'foreach' loop isn't being executed in sub "parse_mail_folder" is only because the actual mailbox (Pegasus format) is not "normal". There is all the usuall email headers and many of the folders/mailboxes have multi-part messages in them, but the first record has a lot of extra chars in it. :)

      More help available at the mailbox mailinglist

      Thanks, I will do a little bit more reading on the Mail::Box information in regards to opening mailboxes, and then will ask for help (I did add a "or die" after the open, but it's okay ??)

      By the way: best way to parse e-mail addresses from a header line is like this:

      my $msg = Mail::Message->read($data); my @addr = $msg->get('To')->addresses;

      The addresses are Mail::Address objects, which are relatively smart. Parsing addresses in reality is a very complicated task.

      I will try that method out, the code I had been using

      my $mgr = Mail::Box::Manager->new(); my $folder = $mgr->open($folder_file) or die "Cannot open Folder","\n" +; my @email_addr; foreach my $message ($folder->messages) { print $message->get('Subject') || '<no subject>', "\n"; print "into foreach loop","\n"; # etc,etc

      .. never gets to print the "into foreach loop", but I guess it is the mailbox 'type'. :)

      Thanks for all your help,

      Peter