monger has asked for the wisdom of the Perl Monks concerning the following question:

I've accumulated an mbox file from a maillist that's grown to about one half of a gig. Needless to say, I'm having trouble getting Thunderbird, or any other mbox mail client, to read the file and, if necessary, create an index to it. So, my plan is to split it up into more manageable parts. First try: year. Here's what I'm thinking: First, I did some diging around the Monestary, in dusty corners and stumbled upon this thread that gave me the start. My code is based off of this snippet. Here's my modded code:
#!/usr/bin/perl -w my $file = "./oldcoffee.tmp"; my $out = ">./coffee.tmp"; my @mails; open OUT, $out || die "Can't open $out!\n"; open FILE, $file || die "Can't open file oldcoffee!\n"; while (<FILE>) { if (/^From /) { $mails[$#mails + 1] = $_; } else { $mails[$#mails] .= $_; } print $OUT, @mails; } close FILE, $file || die "Can't close file oldcoffee!\n"; close OUT, $out || die "Can't close $out!\n";
What I want to do is modify the pattern matching in the if clause to add the year. Here's an example ^From line:

From - Tue Jan 14 17:43:12 2003

I've looked at constructing a regex to account for the day, date and time, then iterate through for each year, 2003-2006. The regex got to be a bit cumbersome, so I tried this:

if (/^From - \+ 2003/)
With this, I receive the following error:

Modification of non-creatable array value attempted, subscript -1 at ./mbox_orig.pl line 13, <FILE> line 1.

Any suggestions on how to tackle this? Thanks, monger

monger |------------------------| munging perl on the side

Replies are listed 'Best First'.
Re: Spliting an mbox file
by gellyfish (Monsignor) on Aug 15, 2006 at 18:34 UTC

    I'd agree with Fletch to some extent but I would recommend Email::Folder to read the mailbox and Email::LocalDelivery to 'deliver' it to the new mailbox:

    use strict; use warnings; + use Email::Folder; use Email::LocalDelivery; + my $count = 0; my $box_size = 100; + my $boxbase = '/home/jonathan/tmp/testmail/spam'; + my $box = Email::Folder->new('/home/jonathan/mail/spam'); while ( my $mail = $box->next_message ) { my $newbox = $boxbase . int($count++ / $box_size); Email::LocalDelivery->deliver($mail->as_string(),($newbox)); }

    /J\

Re: Spliting an mbox file
by Fletch (Bishop) on Aug 15, 2006 at 18:15 UTC

    Simple answer: don't reinvent the wheel, use someone else's wheel that works.

    Update: And as to your error message, your array initially doesn't have anything in it so $#array is -1. -1 elements from the end of an empty array isn't a valid index. See perldiag. Which means that your file doesn't start with a 'From ' line, or at least not one which matches the regexp you're trying to use.

Re: Spliting an mbox file
by rodion (Chaplain) on Aug 15, 2006 at 18:52 UTC
    I'm not so sure Flech is right about the file not starting with a "From " line, I think it's just not starting with a "From ... 2003" line. The way I read what you're doing, you're using something like
    while (<FILE>) { if (/^From .+ 2003/) { $mails[$#mails + 1] = $_; } else { $mails[$#mails] .= $_; } print $OUT, @mails; }
    The else tries to add lines to an array bin that the "From " test has not created, because it hasn't yet gotten to the section of the emails from 2003 that you want. Try something like
    while (<FILE>) { if (/^From /) { if (/2003/ || $mailsStarted) { $mails[$#mails + 1] = $_; $mailsStarted = 1; } } elsif ( $mailsStarted ) { $mails[$#mails] .= $_; } print $OUT, @mails; }

    Note that you will get some non-2003 emails that come after the first 2003 email, but that should be ok.

Re: Spliting an mbox file
by jwkrahn (Abbot) on Aug 15, 2006 at 22:43 UTC
    my $out = ">./coffee.tmp";
    Does your file name have a '>' character in it? You should probably be using the three argument form of open instead.
    open OUT, $out || die "Can't open $out!\n"; open FILE, $file || die "Can't open file oldcoffee!\n"; close FILE, $file || die "Can't close file oldcoffee!\n"; close OUT, $out || die "Can't close $out!\n";
    Using the high precedence || operator means that the dies are superfluous. You should use the low precedence or operator instead. close accepts one argument, the filehandle. You should also include the $! variable in the error messages so you know why the program failed.
    $mails[$#mails + 1] = $_;
    Is better written as:
    push @mails, $_;
    If you want to split the mbox file into separate files based on the year in the /^From/ line then you probably want something like this:
    #!/usr/bin/perl use warnings; use strict; my $file = 'oldcoffee.tmp'; my $out = 'coffee.tmp'; open FILE, '<', $file or die "Can't open file $file! $!\n"; while ( <FILE> ) { if ( /^From / && /(\d{4})$/ ) { # Update: changed 'readonly' to 'append' open OUT, '>>', "$1$out" or die "Can't open $1$out! $!\n"; } fileno OUT and print OUT; } close FILE or die "Can't close file $file! $!\n"; __END__
      open OUT, $out || die "Can't open $out!\n";
      Using the high precedence || operator means that the dies are superfluous. You should use the low precedence or operator instead. .

      Or, use parentheses if one wants to use '||' ...

      open( OUT, $out ) || die "Can't open $out: $!\n";