Strange problem trying to clean garbage from start of mailbox file

capoeiraolly has asked for the wisdom of the Perl Monks concerning the following question:

Hey all, I'm writing a perl script to fix corrupted sendmail mailboxes. Basically, the very first thing in a mailbox has to be the word From, otherwise users can't log in. From time to time bits of garbage get entered in to the start of the mailbox files so I've written a perl script to remove those line of garbage.

The script work fine on my machine (debian) but when I stick it on my mail server (BSD) it throws a hissy fit, and instead of removing garbage it will sometimes work and sometimes wipe the entire mailbox file. It's kind of driving me nuts so any help would be really appreciated..

Here's the script :

# --------------------------------------------------------------------
+----------------------------------------------------------
# This perl script is designed to run through all of the mailboxes in 
+/mnt/mail on
# mail.visp.co.nz removing all corrupted data from the start of any co
+rrupted
# mailboxes.
#
# Coded By : Oliver Sneyd
# When : February 2006
# Contact : oliver.sneyd@mail.iconz.net
# --------------------------------------------------------------------
+---------------------------------------------------------

# Include the file statistics object so that the script can check the 
+filesizes of each mailbox
use File::stat;

# Path to the maildir, /mnt/mail for mail.visp.co.nz
$path = "./mail/";
$backupPath = "./backup/";

# Open up a directory handle
print "\n\tGenerating mailbox list ...\n";
opendir(MAILDIR, $path);

# Read the names of each entry in the maildir in to an arrays
@filenames = readdir(MAILDIR);

# Create an array to hold the mailboxes
my @mailboxes = ();

# Loop through the results returned by the directory handle
for($i = 0; $i < @filenames; $i++)
{
    # If the result returned by directory handle is NOT a directory 
    if(not(-d ($path . $filenames[$i])))
    {
        # Work out the file-size of the current mailbox
        $size = stat($path . $filenames[$i]);
        
        # If the filesize is greater than 0, add it to the mailbox lis
+t
        if($size->size > 0)
        {
            push(@mailboxes, $filenames[$i]);
        }
    }
}

# Loop through the mailboxes
print "\tChecking for corrupt mailboxes ...\n\n";
while(@mailboxes > 0)
{
    $mailbox = pop(@mailboxes);
    checkMailbox($mailbox);
}
print "\n\tDone.\n\n";

# Close the directory handle
closedir(MAILDIR);


# ------------------------------------------------------------ FUNCTIO
+NS ---------------------------------------------------


sub checkMailbox
{
    # Set a corrupt variable to be true
    $corrupt = 1;
    
    # Loop untill corrupt is false
    $initial = 0;
    while($corrupt == 1)
    {
        # Open up the mailbox
        open(MAILBOX, ($path . $_[0]));
        
        # Read in the first line of the mailbox 
        $line = <MAILBOX>;
        
        # Get the index of the string "From"
        $idx = index($line, "From");
        
        # If the index of "From" is 0, the mailbox isn't corrupted any
+ more
        if($idx == 0)
        {
            # So set corrupted to false
            $corrupt = 0;
        }
        else
        {
            # Make a bacukp of the corrupted mailbox, just in case
            if($initial == 0)
            {
                print "\tFixing $_[0] ...\n";
                system("cp $path" . $_[0] . " " . $backupPath . ".");
                $initial = 1;
            }
            
            # And remove the first line of the mailbox
            system("sed -e '1d' $path" . "$_[0] | more > $path" .  $_[
+0]);
        }
        
        # Close the mailbox
        close(MAILBOX);
    }
    
}
[download]

2006-02-03 Retitled by planetscape, as per Monastery guidelines
Original title: 'Strange Problem'

Comment on Strange problem trying to clean garbage from start of mailbox file Download Code

Replies are listed 'Best First'.
Re: Strange problem trying to clean garbage from start of mailbox file by martin (Friar) on Feb 02, 2006 at 23:09 UTC
The main culprit seems to be this line: `system("sed -e '1d' $path" . "$_[0] \| more > $path" . $_[0]);` [download] Oops! You want to edit a file "in place", but by redirecting output to the same location as your input is supposed to be, you effectively truncate that file before it can be processed. When updating file contents you should make sure input and output don't interfere. One approach could be like this: Since you already use perl to read the first line, why don't you just read on until you find a "From" line, and then start copying that and what follows to another file. Finally you can move the result back to the original location. Of course, perl has builtins that can do most of the work for you. Like, for example: `perl -n -i.bak -e 'print if /^From/..-1' mail_file` [download] This snippet removes all lines before the first occurence of a line starting with the four letters F, r, o, m from `mail_file`, leaving a backup of the original in `mail_file.bak`. You should also make sure no mails are delivered while you are working on real life mailbox hierarchies.	[reply] [d/l] [select]
Re^2: Strange problem trying to clean garbage from start of mailbox file by capoeiraolly (Initiate) on Feb 02, 2006 at 23:40 UTC
I'll give the perl command a go, but the sed command does actually work... give it a go. If you have a text file with say three lines in : line 1 line 2 line 3 The result of that system call is (I've tried it on both Debian and BSD) : line 2 line 3 Of course I will make sure that no mail is delivered to the mailbox while i'm messing around with it :)	[reply]
Re^3: Strange problem trying to clean garbage from start of mailbox file by martin (Friar) on Feb 03, 2006 at 03:11 UTC
Your shell command line might sometimes work but the problem is precisely that it is not guaranteed to do so. The reason is that the `>file` part clobbers the very same file that is supposed to be read by the `sed -e '1d' file` part. If there was only one process involved, the outcome would be quite predictable. However, since you constructed a pipeline of two processes there is a chance that the first one wins the race and catches a portion of the file before the file is destroyed by the second one. However, as you already observed, you can not rely on that. To solve that problem you can use a temporary file (like `perl -i` does behind the scene) or read and write to the file through a single file handle in a single process, which may prove somewhat more difficult to get right. If you are interested anyway you may want to look up file access modes in perlopentut, specifically `+<`. You also might find the truncate function useful. The Perl Cookbook has excellent explanations of the different techniques.	[reply]
Re^2: Strange problem trying to clean garbage from start of mailbox file by capoeiraolly (Initiate) on Feb 03, 2006 at 02:41 UTC
Works beautifully. Thanks you for the help :)	[reply]
Re: Strange problem trying to clean garbage from start of mailbox file by graff (Chancellor) on Feb 03, 2006 at 03:53 UTC
Now that martin has solved your basic problem, I'd just like to point out short your code could be: `my $path = "./mail"; my $bkup = "./backup"; open MAILDIR, $path; for my $mbox ( grep { -f "$path/$_" and -s _ } readdir MAILDIR ) { rename "$path/$mbox", "$bkup/$mbox"; system( "perl -ne 'print if /^From/..-1' $bkup/$mbox > $path/$mbox +" ); }` [download] (That assumes that the backup directory is not on a distinct disk volume.) update: as capoeiraolly points out below, that version backs up all mailbox files, not just the ones that need fixing. To avoid that, just add a few lines at the top of the for loop: `for my $mbox ( grep { -f "$path/$_" and -s _ } readdir MAILDIR ) { my $first = do { open M, "$path/$_"; <M> }; close M; next if ( $first =~ /^From / ); # ... do rename and system calls on bad files only rename "$path/$mbox", "$bkup/$mbox"; system( "perl -ne 'print if /^From/..-1' $bkup/$mbox > $path/$mbox +" ); }` [download]	[reply] [d/l] [select]
Re^2: Strange problem trying to clean garbage from start of mailbox file by capoeiraolly (Initiate) on Feb 03, 2006 at 04:19 UTC
Cheers for that... Won't this code create a backup of every mailbox instead of just the corrupted ones?	[reply]
Re: Strange problem trying to clean garbage from start of mailbox file by ptum (Priest) on Feb 02, 2006 at 22:17 UTC
This doesn't explain the problem you're having, but have you considered the possibility that the 'garbage' at the start of your mailbox files might span multiple lines? Your code doesn't account for that possibility. More to the point, you don't check for failure on open, yet later you use system to overwrite the file. While it doesn't seem likely that you would be able to overwrite a file that you couldn't open, it makes me nervous to see you opening a file without checking the result of that open.	[reply]
Re^2: Strange problem trying to clean garbage from start of mailbox file by capoeiraolly (Initiate) on Feb 02, 2006 at 22:37 UTC
Ok the problem is that the script seems to randomly empty the mailbox files (on the BSD box) instead of just removing single/multiple lines of garbage. Sometimes it will work on a mailbox, sometimes it won't... I haven't seen any pattern to it yet. Haven't bothererd with the file opening checking yet because I'm still just testing. It's a contained environment that I'm testing this in, not the actual mailboxes. The code should account for multiple lines of garbage, the while loop will read in a line at a time of the file until the file is either empty or untill the index of "From" is 0. The test I'm running on BSD (errors) is exactly the same as the one that works running on my Debian (no errors) machine. I'm only running it on about 10 mailboxes, with a mixture of corrupted and non corrupted files.	[reply]
Re: Strange problem trying to clean garbage from start of mailbox file by rhesa (Vicar) on Feb 02, 2006 at 22:48 UTC
Why the pipe through more in your system call to sed? I'd also prefer to close MAILBOX before calling system commands on the file, but that may be irrelevant.	[reply]
Re^2: Strange problem trying to clean garbage from start of mailbox file by capoeiraolly (Initiate) on Feb 02, 2006 at 22:54 UTC
If you don't pipe the sed through more it simple wipes the file. Good idea to close the file handle first. Dosn't do anything for the problem though.	[reply]
Re^3: Strange problem trying to clean garbage from start of mailbox file by rhesa (Vicar) on Feb 02, 2006 at 23:09 UTC
i'm not a sed expert by any stretch of the imagination, but wouldn't `sed -i -e '1d' $file` [download] be a more idiomatic way to write it? I believe that piping output into the file you're stream-editing is not the most reliable thing to do. In fact, I'm pretty sure that's why your buffering by "\|more" prevents the file from being clobbered.	[reply] [d/l]
Re^4: Strange problem trying to clean garbage from start of mailbox file by capoeiraolly (Initiate) on Feb 02, 2006 at 23:42 UTC
Re^4: Strange problem trying to clean garbage from start of mailbox file by capoeiraolly (Initiate) on Feb 02, 2006 at 23:45 UTC
Re^5: Strange Problem by graff (Chancellor) on Feb 03, 2006 at 03:29 UTC
Some notes below your chosen depth have not been shown here