BrentD has asked for the wisdom of the Perl Monks concerning the following question:

I have the following code that is supposed to take mail from a Thunderbird MBOX store and upload them to an IMAP server. Yes, I know Thunderbird can theoretically do this on it's own, but it is not working reliably for many of my users, so I need to do it manually. The following code works, up to a point. On very large mail stores, it eventually runs out of memory even on a machine with 32Gb of Ram. Any help solving the out of memory issue would be greatly appreciated.
#!/usr/bin/perl use strict; use warnings; use Mail::Mbox::MessageParser; use Net::Imap::Client; use File::Glob; use IO::Socket::SSL; use Getopt::Long; use File::Basename; use Date::Parse; my $imap_folder = "Imported"; my $source_folder = ".\\"; my $imap_user = 'example'; my $imap_pwd = 'examplePW'; my $imap_server = 'mail.example.com'; GetOptions ('imap_folder|i=s' => \$imap_folder, 'source_folder|s=s' => \$source_folder, 'user|u=s' => \$imap_user, 'password|p=s' => \$imap_pwd); if (! $imap_pwd || ! $imap_user) { print "\nMust specify --user & --password (or -u & -p)\n\n"; } my $prev_folder; Mail::Mbox::MessageParser::SETUP_CACHE( { 'file_name' => 'imapcache.da +t' } ); #$File::Glob::Windows::sorttype = 1; #$File::Glob::Windows::nocase = 1; #$File::Glob::Windows::encoding = File::Glob::Windows::getCodePage(); #print "Encoding: $File::Glob::Windows::encoding"; my @Months = ("Jan","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","S +ep","Oct","Nov","Dec"); my $imap = Net::IMAP::Client->new( user => $imap_user, pass => $imap_pwd, server => $imap_server, port => 143, ) || die "Cannot connect ($!)\n"; $imap->login or die('Login failed: ' . $imap->last_error); #print "Folders:\n"; #foreach my $folder (@folders){ # print "- $folder\n"; #} imap_make_folder($imap_folder); my @folders = $imap->folders; print "Folders:\n"; foreach my $folder (@folders){ print "- $folder\n"; } #print "- ", $_, "\n" for @flds; print "\n"; $imap->select($imap_folder) || die "Unable to select \"$imap_folder\": +". $imap->last_error; my $current_folder = $imap_folder; scan_folder($source_folder); sub scan_folder { my $folder = shift @_; if (! ($folder =~ m/\\$/) ) { $folder .= "\\"; } my @boxes = glob($folder."*"); foreach my $box (@boxes) { if ( ($box =~ /\.msf$/) || ($box =~ /\.$/) || ($box =~ /\.\w{3}$/) + ) { next; } elsif (-f $box) { process_mbox ($box); } } } sub process_mbox { my $mbox_file = shift @_; my $imap_folder = basename($mbox_file); imap_make_folder($current_folder."/".$imap_folder); $prev_folder = $current_folder; $current_folder .= "/".$imap_folder; $imap->select($current_folder) || die "Unable to select \"$current_f +older\":". $imap->last_error; if (-d $mbox_file.".sbd") { scan_folder($mbox_file.".sbd"); } + my $mbox_fh = new FileHandle($mbox_file); my $mbox_reader = new Mail::Mbox::MessageParser( { 'file_name' => $mbox_file, 'file_handle' => $mbox_fh, 'enable_cache' => 1, 'enable_grep' => 1, } ); die $mbox_reader unless ref $mbox_reader; while (!$mbox_reader->end_of_file() ) { my $message = $mbox_reader->read_next_email(); #print "Message: $$message\n"; # my $mtext = $$message; my ($date) = $$message =~ m/From - (.*)\n/i; $date =~ s/(\r|\n)//g; my ($ss,$mm,$hh,$day,$month,$year,$zone) = strptime($date); $year +=1900; $date = "$day-".$Months[$month]."-$year $hh:$mm:$ss -0600"; print "Folder: $current_folder Date: $date\n"; my @flags = ("\\Seen",); $imap->append($current_folder,$message,\@flags,$date) || die "Unab +le to append message: ". $imap->last_error; undef $message; } $current_folder = $prev_folder; #$imap->select($current_folder) || die "Unable to select \"$current_ +folder\":". $imap->last_error; #print "Current Folder: " #print "$mbox_file\n"; undef $mbox_reader; undef $mbox_fh } sub imap_make_folder { my $folder = shift @_; my @folders = $imap->folders; if (! (grep $_ eq $folder,@folders) ) { $imap->create_folder($folder) || die "Unable to create folder \"$f +older\" : ". $imap->last_error; } }

Replies are listed 'Best First'.
Re: Out of Memory using Mail::Mbox::MessageParser
by afoken (Chancellor) on Jun 18, 2015 at 17:16 UTC

    No direct answer to your question, but maybe some hints:

    I assume that your IMAP server runs on Linux or some other Unix.

    Many IMAP servers (can) use the same mbox format as Thunderbird. Simply copying the Thunderbird mbox files to the Unix server may be sufficient. At least, you could try that.

    Some IMAP servers (can) use the maildir, maildir++, or imapdir formats. All three formats are very similar. Upgrading the maildir format to maildir++ or imapdir is trivial and usually happens automatically. There is a perl script named perfect_maildir to convert a mbox file to a maildir directory, and from my own experience, the name is justified.

    If you have no direct access to the IMAP storage, or you have to live with a Windows Server, consider using a Linux (virtual) machine with a temporary IMAP server as an intermediate step. Depending on the temporary IMAP server, either just copy the mbox files or use perfect_maildir to create maildir directories. After that, use a tool like imapcopy to copy from the temporary IMAP server to the final IMAP server.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: Out of Memory using Mail::Mbox::MessageParser
by u65 (Chaplain) on Jun 19, 2015 at 11:04 UTC
Re: Out of Memory using Mail::Mbox::MessageParser
by locked_user sundialsvc4 (Abbot) on Jun 19, 2015 at 11:51 UTC

    Also looking at this code only-superficially, I am healthily suspicious of the use of recursion here:   the fact that I see process_mbox calling scan_folder.

    I would, to start, add quite a few print STDERR some-message statements throughout the code so that you can observe the recursion that might be occurring now.   It could be that at some point it is recursing endlessly or too deeply.   In any case, it could be holding on to far more resources than you realize.

    Therefore, I would suggest redesigning this routine so that it is non-recursive, working instead from a “to-do& list.rdquo;   The list starts by containing only the root folder(s), and as subfolders are encountered their fully-qualified pathnames are added to the list but those folders are not processed at this time.   The main loop therefore processes only one mailbox at a time.   If the processing of a mailbox is done by a sub with local variables, the relevant objects will constantly be being disposed-of, and Perl’s memory-manager will keep the place tidy.

    If a hash structure is used to maintain the to-do list (folder is key, true/false indicates whether it has been processed yet), any possibility of loops within the logic caused by any loops within the structure would be eliminated:   the algorithm would be able to know if it has ever seen this key before (exists() ...).   Don’t attempt to keep resources (other than the connection) open ... let them be re-created each time.

    I suspect that this non-recursive approach will be much simpler, easier to debug, and will use less resources.

      Good ideas.