peterr has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

In Parsing email files, what are the best modules ? , I stated my objectives for a small project. Essentially I need to make sure my email distribution lists are up to date. I'm using Pegasus mail as my email client, and the perl I'm using is ActiveState 5.8.0.806 on a Win box.

Thanks to the code from Roger , in reply to the question, two out of the three objectives are working just fine. :)

The third part, addressed in

sub parse_mail_folder

basically won't work because the 'mailboxes' are not of a Unix format. I couldn't work out why the code wouldn't drop into the 'foreach' loop (in _that_ sub), because it was doing steps 1 and 2 perfectly, and setting up the path/name to the mailboxes okay. Then having a look at one of the examples in the "MailBox" distribution

#!/usr/bin/perl # Demonstration on how to use the manager to open folders, and then # to print the headers of each message. # # This code can be used and modified without restriction. # Mark Overmeer, <mailbox@overmeer.net>, 9 nov 2001 use warnings; use strict; use lib '..', '.'; use Mail::Box::Manager 2.00; # # Get the command line arguments. # die "Usage: $0 folderfile\n" unless @ARGV==1; my $filename = shift @ARGV; # # Open the folder # my $mgr = Mail::Box::Manager->new; my $folder = $mgr->open ( $filename , extract => 'LAZY' # never take the body unless needed ); # which saves memory and time. die "Cannot open $filename: $!\n" unless defined $folder; # # List all messages in this folder. # my @messages = $folder->messages; print "Mail folder $filename contains ", scalar @messages, " messages: +\n"; my $counter = 1; foreach my $message (@messages) { printf "%3d. ", $counter++; print $message->get('Subject') || '<no subject>', "\n"; } # # Finish # $folder->close;

and tried it against one of the Pegasus mailboxes

D:\temp\MailBox\Mail-Box-2.051\examples>perl open.pl c:\pmail\mbox\FOL +03E97.PMM Mail folder c:\pmail\mbox\FOL03E97.PMM contains 0 messages:

I then created a "Unix" mailbox in Pegasus, ran the same script against that, and apart from a number of 'warning' messages, it worked fine. :)

D:\temp\MailBox\Mail-Box-2.051\examples>perl open.pl c:\pmail\mbox\unx +05e52.mbx WARNING: Illegal character in field name From sydney.dialix.com.au!bou +nced-addr Sat Nov 08 10 Mail folder c:\pmail\mbox\unx05e52.mbx contains 8 messages: 1. Summary of your weekly E-Mail charges from DIALix Sydney 2. Summary of your weekly E-Mail charges from DIALix Sydney 3. Connect debit 4. New Cheque book 5. RE: Deposit clearance 6. Deposit clearance 7. RE: September statement 8. September statement

... so, I have now realised that there is probably no module to parse through the Pegasus email boxes, because the mailboxes are not of a Unix format. I don't intend to convert the mailboxes also, as there are nearly 300 and I can't find a tool to convert them "on mass"

Therefore, this last part of the task/project is to simply read through the 300 'mailboxes' (files), and look for any email addresses. The email addresses could be in the header or body of the files, and many are multi-part and have encoded parts. All that said, it's just:

1. Reading each file, record by record (CR/LF as terminators)
2. Search the record for any email address/s
3. Extract the email address/s, and if it isn't in the array from sub "load_mail_list", then add the details to the array, like what Roger suggested

if (! exists $MailList->{$addr}) { # ok, we haven't seen this Email address yet $MailList->{$addr} = $name; # and do other things }

Could someone show me how to search for either one or multiple email addresses in a line (record) please. It would be nice to also grab the 'name' in addition to the email address, but that might be rather difficult, as there are so many different formats of defining the name with an email address.

Thanks, :)

Peter

Replies are listed 'Best First'.
Re: Parsing Pegasus email boxes
by graff (Chancellor) on Nov 16, 2003 at 08:34 UTC
    For grins, I googled "pegasus mail", and found some pages at http://www.pmail.com/, including a pointer to a handful of tools for converting from Pegasus mail files to other forms (for some reason, the link is found here: http://kbase.pmail.gen.nz/pegasus.cfm). I didn't find anything useful about the Pegasus file format (but did find some indications that the format specs would not be open to public inspection).

    Anyway, if any of those tools happens to work as a command-line utility (and is still available, up-to-date and within your budget), you could use it via a system call (or even a pipeline open() statement) to convert files as needed within your perl script.

    If that doesn't pan out, and you're just trying to extract email address strings, perhaps others can recommend a good module. I just tried Mail::Address, and it seems to work well for the following case:

    • You have the RFC822 message header pretty much intact
    • You pull out the (potentially multi-line) records from the header that start with "From: ", "To: ", "Cc: " and "Bcc: "
    • You pass each record (the part that follows ":"), to Mail::Address->parse()

    In other words, something like this (not tested on pegasus data):

    use strict; use Mail::Address; my $hdrstring = ""; my $checknext = 0; open( M, "pegasus-mail.file" ) or die $!; while (<M>) { if ( /^(?:To|From|Cc|Bcc): (.*)/ ) { $checknext++; $hdrstring = $1; } elsif ( $checknext and /^\s+\S.*/ ) { $hdrstring .= $_; } else { $checknext = 0; if ( $hdrstring ) { my @addr = Mail::Address->parse( $hdrstring ); for my $a ( @addr ) { print $a->name, " <", $a->address, ">", $/; } $hdrstring = ""; } } }
    update: Having seen Roger's non-module solution, I'd like to point out that multi-token addresses (like "My Name <me@home.net>" can often get split up by a line-break in the header. Also, you may sometimes see the form "me@home.net (My Name)". The usage shown above for Mail::Address handles both problems, and normalizes the latter case, so that "My Name" is returned by "$a->name()", just like in the former case.
Re: Parsing Pegasus email boxes
by liz (Monsignor) on Nov 16, 2003 at 11:12 UTC
    Maybe it makes sense to apply your acquired knowledge on Pegasus mailboxes to Mail::Box, so you would be able to make use of all of the nice features of Mail::Box, and the rest of us would be able to use Pegasus mailboxes with Mail::Box as well.

    Contact markov or Mark Overmeer for more info.

    Liz

      Yes, it makes sense to write such an extension: I welcome all contributions. I am not familiar with Pegasus mail, but will support anyone who tries to implement it.

      MailBox uses OODoc, which means that it is simpler to write consistent documentation. If you decide to start coding based on an existing folder implementation, please ask me for a "raw" pm file (which contains both the OODoc POD and the code).

      Volunteers?

Re: Parsing Pegasus email boxes
by jonadab (Parson) on Nov 16, 2003 at 20:33 UTC

    First off, there are good reasons why Pegasus Mail uses its own format. Pegasus Mail's folders support some things that would not be possible with mbox format and are in some ways more robust. These days a mail directory format (e.g., nnml) would probably be better, but at the time the pmail folder format was designed, that would have been impractical for a number of reasons.

    There is not a large overlap between Perl users and pmail users, but there may be a few. I have used Pegasus Mail myself in the past, before I learned Gnus, and might be interested in collaborating with you on creating a parser for pmail folders.

    For just parsing, you can completely ignore the .pmi files; those are indices. (They might improve performance, but it's probably not worth figuring out a second file format just for that.) If you wanted to *write* pmail folders, then you would either have to update the indices, or delete them, the former being much preferable since deleting them would cause pmail to have to regenerate them next time it starts, which would cause a quite user-noticeable wait. But for just parsing, which sounds like a good initial goal, this is not a consideration.

    The good news about the format is that most of the binary-encoded information is stuff you probably don't need, if you aren't trying to clone Pegasus Mail. Labels and flags and annotations and things. The bodies are stored *mostly* unaltered, though of course there are some provisions to prevent any specific character's presense in a message from causing problems, so messages with characters outside the printable ASCII range are encoded in some fashion (if they weren't already for transport over SMTP, though they ought to have been, in theory).

    I have a number of pmail folders myself and, as I said, have some interest in working on a perl module for this. I am very unlikely to get around to doing it on my own, however.

    As far as a formal specification of the format, I do not believe one has ever existed. I suppose that David Harris is using the source code or comments in the source code as his documentation, and he is unwilling to let out the source without an NDA. (This also is why there is no Linux port. David is ammenable in theory to the idea of someone doing such a port, but they would have to sign an NDA and meet other criteria. An unusual attitude for a freeware developer, perhaps, but Pegasus is of unusual quality, as well. I have switched to Gnus, because I was unwilling to be tied to a single platform and not skilled in C/C++ to attempt the port, even if I were willing to sign an NDA (something I have yet to think about enough to decide; I would have thought about it for pmail, if it were written in a language I would be comfortable working with to do a port).) So, we would have to reverse-engineer the format.


    $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
Re: Parsing Pegasus email boxes
by Roger (Parson) on Nov 16, 2003 at 07:35 UTC
    Hi Peter, I have written a simple email address capture program, without using any Mail:* modules. Provided that the Emails do not have the "From: " line in the subjects, the following should work fine.

    #!C:/Perl/bin/Perl.exe -w use strict; use Data::Dumper; my %MailList; # A list of all seen addresses. # This hash is setup in step 1 in the # previous post. # Parse the mail box while (<DATA>) { next unless /^From:\s/; # ignore all other lines chomp; # except the from line # assume that email addresses are separated # by the ',' character. my @emails = split /,\s*/, substr($_, 6); # build a hash from the email address list my %emails = map { s/\s+$//g; # remove tailing spaces my ($name, $address); if (/>$/) { # look for < ... > ($name, $address) = /^(.*)\s<(.*)>$/; } else { $name = $address = $_; } $address => $name } @emails; # inspect captured email details # print Dumper(\%emails); # check if the email address(es) are already seen foreach (keys %emails) { if (! exists $MailList{$_}) { # ok, we haven't seen this Email address yet $MailList{$_} = $emails{$_}; } } } print Data::Dumper->Dump([\%MailList], ['MailList']); __DATA__ From: Roger <roger@somewhere.com>, Roger 2 <roger2@nowhere.com> To: Peter R <peterr@us.com> Subject: Hello Hello Peter, this from address has two addresses From: roger3@somewhere.com To: Peter R <peterr@us.com> Subject: Hello Hello Peter, this from address has no full name From: roger3@somewhere.com To: Peter R <peterr@us.com> Subject: Hello Hello Peter, has already seen this from address
    And the output is
    $MailList = { 'roger2@nowhere.com' => 'Roger 2', 'roger3@somewhere.com' => 'roger3@somewhere.com', 'roger@somewhere.com' => 'Roger' };
    The code is able to handle Email address in the forms of:
    Full Name <email.address@somewhere.com> Email.Address@somewhere.com
    You can tweak it a bit if the format of the Mail Addresses are different, or have a address separator other than ','. Oh yes, and put it in a subroutine and add more error checking for a good coding practise.

    Update: I agree that Mail::Address module is much more capable than my simpler version. And I agree that using Mail::Box is really nice and easy. But Peter was having plenty of trouble to make them work, so I posted an interim solution to help Peter to get the things going at the mean time.
Re: Parsing Pegasus email boxes
by peterr (Scribe) on Nov 17, 2003 at 00:11 UTC
    Hi Roger,

    I have written a simple email address capture program, without using any Mail:* modules. Provided that the Emails do not have the "From: " line in the subjects, the following should work fine.

    Yes, it did, thanks, and I was surprised at how out of date the distribution list is. Considering the mail boxes are more than 100Mb, perl was very quick to do the job, less than 30 seconds I'd say. No doubt it will take marginally longer to look at all the records (rather than just the "From" lines); I'd need to do this because sometimes people advise of new addresses in the body portion. I put it in a sub as you suggested. Your interim solution is ideal for my purposes, and helped me learn how to parse through files, looking for specifics (I find the Windows 'find' quite frustrating, as it won't look for multiple strings).

    Thanks,

    Peter

Re: Parsing Pegasus email boxes
by peterr (Scribe) on Nov 17, 2003 at 00:45 UTC
    Hi graff,

    Yes, most of the 'conversion' tools for Pegasus are for address book conversion, and none of the ones I previously looked for said "Pegasus --> Unix". However, following the link you supplied and doing a bit more research, I finally realised that Eudora mailboxes are Unix format, and there _are_ a few tools to go either way (Eudora<-->Pegasus), so that means if one converted a Pegasus mailbox to Eudora, we end up with a Unix mailbox. :)

    I didn't find anything useful about the Pegasus file format (but did find some indications that the format specs would not be open to public inspection).

    I've found the Pegasus discussion list very active in the past, and very helpful, and would be surprised if someone wouldn't know the layout. That said, I had my own look at them, as follows:

    1. The first 96 bytes contains: * Folder description, right filled with lowecase 'a' * Other folder/hiearchy info * More lowercase 'a' to pad to the right 2. Then the format appears (to me) to be the same as Unix 3. The message termination seems to be indicated by: * hex OD OA OD OA 1A * CR LF CR LF -> 4. From '3', it is only the hex '1A' value that could indicate the msg + terminator, because email bodies could have two CR/LF's anywhere. 5. I don't know if any encoding would contain hex '1A" though, so look +ing for the 5 character string (in hex) would be safer, plus looking +at the next line (record) in the mailbox, it would be something like +"X-Apparently-To: "

    Some other links I found on the conversion tools side are:

    http://www.andtechnologies.com/free.html
    http://www.interguru.com/MailInformation.htm
    http://www.dragon-it.co.uk/pegasus.htm

    You mentioned Mail::Address would be good, considering some dependencies. It's just that the Pegasus format is a _bit_ different, really just the first 96 bytes to ignore, and sort out the msg terminator. Hmm, my apologies, I just tried your sample code on a Pegasus mailbox and it dumped (printed) all the email names/address, thanks.

    I see what you mean by multi-token address, addresses that are of many formats, and ones that are split over two lines; no doubt this can be a problem. :)

    Peter

      So, if you are fairly confident about the character string that marks the end of each mail message in this file format, then the perl code that would step through a file one message at a time would be pretty simple:
      { open( IN, "pegasus.file" ) or die $!; local $/ = "\x1a"; # change the input record separator while (<IN>) # $_ now contains one whole message, { # up to and including \x1a # do what you want with the message # including, if you like: @lines = split /\n/; # work on lines # or you could strip off the pegasus header stuff # and treat the remainder as if it were vanilla email # (just like you'd get with common unix mail tools) } } # record separator now reverts back to default
        Hi,

        Thanks, that code worked okay. I think I now need to do some serious reading up on working with arrays and strings. :)

        Peter

Re: Parsing Pegasus email boxes
by peterr (Scribe) on Nov 17, 2003 at 00:51 UTC
    Hi liz,

    Maybe it makes sense to apply your acquired knowledge on Pegasus mailboxes to Mail::Box, so you would be able to make use of all of the nice features of Mail::Box, and the rest of us would be able to use Pegasus mailboxes with Mail::Box as well.

    Yes, that's a good idea, I'm sure it would be a good extenstion for Mail::Box, possibly passing the mailbox "type" to indicate the mailbox is of Pegasus format.

    Peter

Re: Parsing Pegasus email boxes
by peterr (Scribe) on Nov 17, 2003 at 01:09 UTC
    Hi jonadab,

    You certainly have a good knowledge of the Pegasus product, and I see you use it also

    There is not a large overlap between Perl users and pmail users, but there may be a few. I have used Pegasus Mail myself in the past, before I learned Gnus, and might be interested in collaborating with you on creating a parser for pmail folders.

    Possibly working with Mark Overmeer , in extending the Mail::Box module to include using Pegasus mailboxes

    For just parsing, you can completely ignore the .pmi files; those are indices. (They might improve performance, but it's probably not worth figuring out a second file format just for that.)

    Yes, they are of no use really, outside of Pegasus (the product). Also, when you delete a email msg , like many methods with deleting, the data is only flagged as being deleted and has not been physically removed. You need to 'rebuild' the mailbox, which physically removes the dead data and rebuilds the index

    I have a number of pmail folders myself and, as I said, have some interest in working on a perl module for this. I am very unlikely to get around to doing it on my own, however.

    Possibly you would be interested in working on the contribution, if Mark wants to extend the module ?

    As far as a formal specification of the format, I do not believe one has ever existed.

    In theory at least, I don't think it is _too_ hard to work out the layout, I checked some of it out (see my other reply)

    I suppose that David Harris is using the source code or comments in the source code as his documentation, and he is unwilling to let out the source without an NDA. (This also is why there is no Linux port. David is ammenable in theory to the idea of someone doing such a port, but they would have to sign an NDA and meet other criteria. An unusual attitude for a freeware developer, perhaps, but Pegasus is of unusual quality, as well.

    Pegasus is a very good product, especially considering it has been around for so long as freeware. David only gets an 'income' from people who buy the manuals, so I don't know how he keeps 'bread on the table', so to speak. :)

    Thanks,

    Peter