johnfoobar has asked for the wisdom of the Perl Monks concerning the following question:

here's what i've got right now:
{ local $/ = undef; $_ = <>; } my @mails = split /\nFrom .*\n/m;
however i have to read the entire file into memory, and they're big (>30MB). i would like to process the file line-by-line to get the same @mails array, so i don't have to read the whole damn thing into a scalar. how should i go about doing this?

i have a feeling i need to put the split in a loop, but i don't know where to begin. i'm a regexp newbie which doesn't help either.

(BTW i am aware of the Mail::Folder::Mbox and Mail::Box::Mbox modules on CPAN but they're overkill for what i want.)

Replies are listed 'Best First'.
Re: stream parsing an mbox file
by repson (Chaplain) on Apr 12, 2001 at 15:11 UTC
    To repeat the behaviour of your code I'd do:
    my @mails; while (<>) { if (/^From /) { $mails[$#mails + 1] = $_; } else { $mails[$#mails] .= $_; } }
    It seems to me from previous experimentation that any lines in the body text starting with 'From ' will already be quoted with a > character, so that they will not cause the test I used to pass and will not therefore create a new array element.
      that's exactly what i was looking for.

      took me a while to spot the ".=" in the else, i thought it was an "=" for a while. :)

      PS, how did you see this post? i can't see it at The Monastery Gates, or in SOPW.. i had to do a search to see my own post. i must be doing something wrong.

        Check the Newest Nodes
Re: stream parsing an mbox file
by physi (Friar) on Apr 12, 2001 at 15:02 UTC
    You should have a look at the .. operator.
    while (<FILE>){ if (/From/ .. /From/) { print $_; } }
    The Problem will be, that your going into the if-part , even when someone mentioned the word "From" in his email :-)
    ----------------------------------- --the good, the bad and the physi-- -----------------------------------
      so the if, in english, says: "print $_ if it's between these two things", right?

      the only problem i have with this is it's not adding the lines to a scalar, one for each message. the .. operator does look interesting though.

      i don't have a problem with quoting "From", it gets escaped with a ">" in most mbox files i've seen, so you have ">From" on such lines.

Re: stream parsing an mbox file
by jink (Scribe) on Apr 12, 2001 at 15:02 UTC
    This is what i came up with:
    #!/usr/local/bin/perl5 -w use strict; my $mbox=$ARGV[0] || die("Usage $0 <mailbox>\n"); open(MBOX, "<$mbox") || die("Can't open mailbox $mbox! ($!)\n"); my @mails = grep(/^From /, <MBOX>); close(MBOX); print @mails;
Re: stream parsing an mbox file
by davorg (Chancellor) on Apr 12, 2001 at 15:59 UTC

    You can probably do something by setting $/ to an appropriate value - perhaps "\nFrom ";

    Actually, I think I tried something like this once before. Problem is that it leaves "\nFrom " on the end of each mail and removes it from the start (except for the first one) so you have to munge them a bit. Something like this perhaps:

    my @mails; { local $/ = "\nFrom "; @mails = <>; chomp(@mails); $mails[$_] = "$/.$mails[$_]" foreach 1 .. $#mails; }
    --
    <http://www.dave.org.uk>

    "Perl makes the fun jobs fun
    and the boring jobs bearable" - me

Re: stream parsing an mbox file
by sierrathedog04 (Hermit) on Apr 12, 2001 at 23:40 UTC
    There is a contention issue here. This file is large. What if the mail program writes to the mailbox before the program is finished processing?

    At first I thought, no problem, just use flock. But then I read what Camel III says about flock:

    Despite the suggestive sound of "exclusive", processes aren't required to obey locks on files. That is, flock only implements advisory locking, which means that locking a file does not stop another process from reading or even writing the file.
    Contention is a major problem, because the time to process a 30MB file is not trivial, and in the meantime more mail is on its way at any moment. I question whether a simple program can solve the problem, so maybe Mail::Folder::Mbox or Mail::Box::Mbox are needed after all.

    It's the classic Hubris (I can write this myself) vs. Laziness (I don't want to reinvent the wheel) problem. I believe that Laziness should trump Hubris, and in a real world setting it would thus be better to use the CPAN modules.