tdp05 has asked for the wisdom of the Perl Monks concerning the following question:

I have about 6500 email messages in a Pine mailbox that I need to retrieve data from. From my readings so far, I'm working under the assumption that Pine uses the "mbox" format for mail (using pine v4.53 on a linux box). It stores all email from one mailbox in a flat text file. Please let me know if I'm wrong about the mbox format.

    Constraints and goals:
  1. Read all the messages and get what I want (see below for specifics)
  2. I will be looking at every single message in the mailbox
  3. I can safely ignore header information, except for Subject
  4. There will be no attachments
  5. If all goes well, I will be using this script once to collect the data, and that's it.
  6. I'm an Object-oriented noob (never had to use it yet)
I've been reading, and the sheer number of mbox and mail parsers available has me a little turned around. So I'm looking for advice before I begin writing the script.

I've looked at: this node. He asks the same question, but didn't give file specifics. I hope I pinpointed it above.

this node. Where the guy uses Mail::MboxParser to parse.

There was this: Mail::MboxParser::Mail::Body

Mail::Box From the page, this looks like I could just do a foreach on every email. Which I think would work, but how slow is it going to be?

I also saw mention of just looking for paragraphs that start with
"^From <email address> <date>"
Should I do this, because it seems like it could be the most basic way of doing it?

the mail will look like this:

...normal header info... Subject: Backup SUCCESS ..rest of header... Backup ID: 123 Path: <path> Backup Type: one of 3 different types Size: <in Kbytes> ...some other junk I don't need... Start time: Wed May 5 09:36:40 2004 End time: Wed May 5 09:43:19 2004 Errors: 0 some other stuff Elapsed time: 0 hr 6 mn 39 s

I'm grabbing the ID,the path, the type of backup, the size, and i'm going to throw it into a mysql table. The ID is the primary key, but the size and elapsed time aren't stored anywhere after the email, so we want to grab it for comparisions and whatnot.

Given what I'm faced with, which way should I go?

Thanks.

Replies are listed 'Best First'.
Re: Parsing a Pine mailbox
by kvale (Monsignor) on May 06, 2004 at 08:00 UTC
    It seems to me that parsing into individual emails won't really buy you anything. A simple parse is best. Here I illustrate the idea for id, path and size:
    my %rec; my $pid; while (<>) { $pid = $1 if /^Backup ID: (\d+)$/; $rec{$pid}{path} = $1 if /^Path: (\S+)$/; $rec{$pid}{size} = $1 if /^Size: (\d+)$/; }
    Because the attributes always come after the primary id, little bookkeeping is required to populate a hash, or a database. It isn't clear to me how the subject enters the problem, so I left that out.

    -Mark

      yea, I guess you're right. I think anything in the subject will be displayed in the body of the message, and I'll be able to grab it from there.

      I didn't think to look at the most basic way to do it. I figured I'd have to look at something much more complex.

      I'll probably give this a shot and see how it goes.

Re: Parsing a Pine mailbox
by matija (Priest) on May 06, 2004 at 08:26 UTC
    Which I think would work, but how slow is it going to be?

    Not slow enough to matter. If you're only going to be running this script once, it doesn't matter if it takes an hour to complete. Write it in a way that will save the most of your time.

    No need to worry about details of efficiency unless you're working on something that is going to be used often. Remember that your time is worth more than the computer's.

Re: Parsing a Pine mailbox
by jaldhar (Vicar) on May 06, 2004 at 14:14 UTC

    You mentioned being confused by the large number of mail parsing modules available. For maximum compatability I would advise using the Mail::Cclient module from CPAN as it is just a thin perl wrapper around the exact same C library pine uses to parse mail.

    Although it is most likely pine is using mbox folders, there is a possibility it could be one of the several other formats c-client supports which are also stored as flat files.

    Also I have used Mail::Box and it is quite fast. It is however IIRC pure perl so there is a chance that Mail::Cclient being based on C is faster. But I haven't done any benchmarks.

    --
    જલધર