cookiez has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks

I am so knew to Perl, so please bear with me!

I have some mailboxes that store all messages inside a single file handle! Each message begins with...

########## From name@domain.com time_sent ##########

There are other lines inside each message that start with 'From'!

Example....

mail header...

########## From: name@domain ##########

The other 'From(s)' that are inside a message body that start on a 'new line' are pushed out using a single '\s'
so that the message starting and ending point can always be found!

example... (From contained in a message body)

######### (space)From #########

So knowing that information what would be the best way to split each message! Only starting to learn Perl so I am
confused at what would be best to use to handle this! A friend said to use index() and then store each postion in a
array! I tried that in a for() loop, but with over 50,000 messages in some mailboxes it seemed to be not the best way
to use (2) index() calls then a single substr() to grab each message!

Thank You

Cookie

Replies are listed 'Best First'.
Re: Mailbox spliting and storing each message!
by mpeppler (Vicar) on Dec 12, 2004 at 19:28 UTC
      Hi I was told that Mail::Box does not handle this type of mailbox, that I would first have to split each email and put
      each one in it own file before I could use it! Is this not correct? If it isn't I will take you advise...

      Thanks
      Cookie
        If I read your original post correctly you have a file in Unix "mbox" format, where each message starts with something like
        From MAILER-DAEMON@gw.peppler.org Thu Sep 16 17:29:10 2004 Return-Path: <MAILER-DAEMON@gw.peppler.org> Received: from ntexchsrv.ilstmd.ins (ntexchsrv.ismis.com [12.2.26.87]) by gw.peppler.org (8.12.10/8.12.10) with ESMTP id i8GLT9n60041 +56 for <junk@peppler.org>; Thu, 16 Sep 2004 17:29:09 -0400 From: postmaster@ismie.com To: junk@peppler.org Date: Thu, 16 Sep 2004 16:29:45 -0500 ... rest of headers ...
        AFAIK this is what Mail::Box::Mbox and Mail::MboxParser are built to handle.

        You can also build your own parser, of course, but you have to be careful not to get tripped up by any "From " strings that might be embedded in messages...

        Michael Michael

Re: Mailbox spliting and storing each message!
by TedPride (Priest) on Dec 12, 2004 at 21:03 UTC
    I'm assuming you don't want to load the entire mailbox into memory at once, since spam can make even a few days of email pretty large. The following is a line by line interpreter of the format you gave, with test data:
    use strict; use warnings; my $c; my $text = <DATA>; while (<DATA>) { if (index($_, 'From') == 0) { process($text); $text = $_; } else { $text .= $_; } } process($text); print "\n$c messages total\n"; sub process { # Do whatever you do to each message... $c++; print $_[0], "\n"; } __DATA__ From ???@??? Wed Feb 02 11:47:05 2000 Message-ID: <38968A8A.DBB01A7@earthlink.net> From: Someone <beepbeep@earthlink.net> X-Mailer: Mozilla 4.6 [en] (Win98; U) Message 1 From ???@??? Wed Feb 02 11:47:05 2000 Message-ID: <38968A8A.DBB01A7@earthlink.net> From: Someone <beepbeep@earthlink.net> X-Mailer: Mozilla 4.6 [en] (Win98; U) Message 2
    Rather kloodgey, but it works. Just replace DATA with whatever filestream has the email file.
Re: Mailbox spliting and storing each message!
by nedals (Deacon) on Dec 13, 2004 at 01:37 UTC
    If you need to save all the seperated data, here's a technique using an 'array of arrays'
    If you're OK with case-sensitive data, then you could use the index method as shown in TedPride's post
    use strict; my @mail_item = (); my $i = -1; ## open FIL ...... ## while <FIL> {.... while (<DATA>) { chomp $_; if (/^from/i) { $i++; } push @{ $mail_item[$i] }, $_; } ## close FIL; use Data::Dumper; print Dumper(\@mail_item); __DATA__ From: name1@domain.com time_sent1 From: name1@domain.com some content1 From: name2@domain.com time_sent2 From: name2@domain.com some content2 From: name3@domain.com time_sent3 From: name3@domain.com some content3 From: name4@domain.com time_sent4 From: name4@domain.com some content4
Re: Mailbox spliting and storing each message!
by DrHyde (Prior) on Dec 13, 2004 at 09:35 UTC
    I count nine exclamation marks in your post. Is having a friend recommend using index() and an array really that exciting?