Re^2: Subsetting text files containing e-mails

Much appreciated! However, I think you're assuming that each e-mail will begin with '^From: ', which is not the case since the SMTP envelope may contain any arbitrary number of lines before then (e.g., 'x-sender:') (I think if I could rely on every e-mail to start with the same sequence--or to have any guaranteed structure--then this might certainly be easier! But from reading the germane RFCs I can't count on that structure necessarily).

Instead, how I have defined the problem is this:

Find some line that is (almost certainly) going to be in the SMTP envelope (in this case, the only fields I think are almost guaranteed to be there are "From" and "Date").
Find the preceding blank line (since I am almost certain that there are blank lines between each e-mail--of course, there are also blank lines within e-mails as well).

So, it occurred to me that all I need to do is read through the file line-by-line and keep track of three line numbers:

the line number I'm on "right now"
the line number of the last blank line observed
the line number of the blank line before the blank line

I can find and store these in an array and then do a second pass through the file to subset.

Could I get your opinion on the following? It works, but I am certain it can be improved.

#!/usr/bin/env perl

use strict;

use Getopt::Std;

my %opts;
my $FileToHandle;

getopts('o:', \%opts);

my $k = 0;

sub ParseEmail {

    my $FileToProcess = $_[0];
        
    my @mailBoundaries=();
    
    my $myLine = 0;
    my $recentBlank = 1;
    my $previousBlank = 1;

    # First pass to find out where to split the file...
    
    open (FILETOREAD, $FileToProcess) or die "Can't open $FileToProces
+s: $!\n";

    while (<FILETOREAD>) {
    
        if (/^$/) { 
            
            $recentBlank = $myLine+1;
            
            print "Recent Blank: $recentBlank\n";
        }

        if (/^From: / && $previousBlank != $recentBlank) { 
            
            push(@mailBoundaries, $previousBlank);
            $previousBlank = $recentBlank;
            print "PreviousBlank: $previousBlank\n";
        }

        if (eof && $previousBlank == 1) {
        
            push(@mailBoundaries, $previousBlank);
            push(@mailBoundaries, $myLine+1);
        
        } 
        
        elsif (eof && $previousBlank != 1) {
        
            push(@mailBoundaries, $myLine+1);
            
        }
        
        $myLine+=1;
        print "My Line: $myLine\n";
    }

    close (FILETOREAD);

    # Second pass to subset the file
    
    my $i = 0;
    
    while ($i <= ($#mailBoundaries - 1)) {
        $k+=1;
        my $j = 0;
        
        open (FILETOREAD, $FileToProcess) or die "Can't open $FileToPr
+ocess: $!\n";
        
            while (<FILETOREAD>) {

                $j+=1;
            
                if ($j >= @mailBoundaries[$i] && $j <= @mailBoundaries
+[$i+1]) {
                
                                        my $FileNameToWrite = $opts{'o
+'} . "_" . sprintf("%06d", $k);

                    print "Mail Boundaries:";
                    print map { "$_ \n" } @mailBoundaries;
                    print "\n";
                    print "I am going to print lines @mailBoundaries[$
+i] to @mailBoundaries[$i+1] from $FileToProcess to $FileNameToWrite.\
+n";

                    open (FILETOWRITE, ">>$FileNameToWrite") or die "C
+an't open $FileToProcess: $!\n";
                    
                    print FILETOWRITE $_;
                                
                }
                
            }
        
        close (FILETOREAD);
                    
    $i+=1;
    
    }
        
}

foreach $FileToHandle (map { glob } @ARGV) {
    ParseEmail($FileToHandle);
}
[download]

One way to improve it might be to simply store all the preceding lines in a buffer array, and when I encounter a "From," instead of recording the blank line numbers, write that array to a file. I think this borrows a page from your book inasmuch as I'd be reading more than just a line at a time, but I haven't yet worked out how to know when I have encompassed an "interesting chunk" of the source data and can write it to a file. Working on that now. I'm really not sure how much of performance hit I should expect for very large arrays (considering many of these e-mails may have very large attachments which are even larger when rendered as base64). I would appreciate your thoughts on that as well.

Thanks again--especially for the link to perlvar, very educational.

Comment on Re^2: Subsetting text files containing e-mails Download Code

Replies are listed 'Best First'.
Re^3: Subsetting text files containing e-mails by GrandFather (Saint) on Jan 27, 2012 at 07:34 UTC
"I think you're assuming that each e-mail will begin with '^From: '" Actually, no. `/^From:/im` performs a case insensitive multi-line match. The ^ anchors the start of any line (and is unaffected by setting $/) so the match will find "From" and the start of the string or at the start of any following new line delimited "line". Try taking the sample code I provided reorder the header line, add new header lines, whatever takes your fancy so long as you don't add bogus blank lines before the "From" line. Another useful link may be perlretut. There's a lot of reading there, but it will be worth the time working through it! True laziness is hard work	[reply] [d/l]
Re^4: Subsetting text files containing e-mails by PeterCap (Initiate) on Jan 27, 2012 at 08:26 UTC
Aha! I get it. So essentially when a paragraph is found that contains '^From:' it places a marker at the beginning of that paragraph? I could not figure out how it was handling all the blank lines within the e-mails until I realized that it wasn't and didn't need to. Just to be clear, in order to actually subset the file I would still need to close and reopen it, right? I'm thinking something like: `open (<MYDATA>, $filein); while (<MYDATA>) { if (/^---- Email 1/ ... /---- Email2/) { open (<MYOUTPUT>, ">$fileout"); print MYOUTPUT $_; close (MYOUTPUT); } } close (MYDATA);` [download] I suppose I might create a loop so that a new value for the search terms (i.e., `/^---- Email 2/ ... /^---- Email 3/` for the second iteration, etc.) is selected as well as a new output file to catch the results...	[reply] [d/l] [select]
Re^5: Subsetting text files containing e-mails by GrandFather (Saint) on Jan 27, 2012 at 09:17 UTC
You don't need more than one pass through the source file. Just create the output files as you need them. In sketch you'd have something like: `use strict; use warnings; my $emailNum; my $outFile; $/ = ''; # Set readline to "Paragraph mode" while (<DATA>) { if (!$emailNum \|\| /^From:/im) { close $outFile if $outFile; my $fname = sprintf "mails_%06d.txt", ++$emailNum; open $outFile, '>', $fname or die "Can't create $fname: $!\n"; } print $outFile $_; }` [download] True laziness is hard work	[reply] [d/l]