comment on

Much appreciated! However, I think you're assuming that each e-mail will begin with '^From: ', which is not the case since the SMTP envelope may contain any arbitrary number of lines before then (e.g., 'x-sender:') (I think if I could rely on every e-mail to start with the same sequence--or to have any guaranteed structure--then this might certainly be easier! But from reading the germane RFCs I can't count on that structure necessarily).

Instead, how I have defined the problem is this:

Find some line that is (almost certainly) going to be in the SMTP envelope (in this case, the only fields I think are almost guaranteed to be there are "From" and "Date").
Find the preceding blank line (since I am almost certain that there are blank lines between each e-mail--of course, there are also blank lines within e-mails as well).

So, it occurred to me that all I need to do is read through the file line-by-line and keep track of three line numbers:

the line number I'm on "right now"
the line number of the last blank line observed
the line number of the blank line before the blank line

I can find and store these in an array and then do a second pass through the file to subset.

Could I get your opinion on the following? It works, but I am certain it can be improved.

#!/usr/bin/env perl

use strict;

use Getopt::Std;

my %opts;
my $FileToHandle;

getopts('o:', \%opts);

my $k = 0;

sub ParseEmail {

    my $FileToProcess = $_[0];
        
    my @mailBoundaries=();
    
    my $myLine = 0;
    my $recentBlank = 1;
    my $previousBlank = 1;

    # First pass to find out where to split the file...
    
    open (FILETOREAD, $FileToProcess) or die "Can't open $FileToProces
+s: $!\n";

    while (<FILETOREAD>) {
    
        if (/^$/) { 
            
            $recentBlank = $myLine+1;
            
            print "Recent Blank: $recentBlank\n";
        }

        if (/^From: / && $previousBlank != $recentBlank) { 
            
            push(@mailBoundaries, $previousBlank);
            $previousBlank = $recentBlank;
            print "PreviousBlank: $previousBlank\n";
        }

        if (eof && $previousBlank == 1) {
        
            push(@mailBoundaries, $previousBlank);
            push(@mailBoundaries, $myLine+1);
        
        } 
        
        elsif (eof && $previousBlank != 1) {
        
            push(@mailBoundaries, $myLine+1);
            
        }
        
        $myLine+=1;
        print "My Line: $myLine\n";
    }

    close (FILETOREAD);

    # Second pass to subset the file
    
    my $i = 0;
    
    while ($i <= ($#mailBoundaries - 1)) {
        $k+=1;
        my $j = 0;
        
        open (FILETOREAD, $FileToProcess) or die "Can't open $FileToPr
+ocess: $!\n";
        
            while (<FILETOREAD>) {

                $j+=1;
            
                if ($j >= @mailBoundaries[$i] && $j <= @mailBoundaries
+[$i+1]) {
                
                                        my $FileNameToWrite = $opts{'o
+'} . "_" . sprintf("%06d", $k);

                    print "Mail Boundaries:";
                    print map { "$_ \n" } @mailBoundaries;
                    print "\n";
                    print "I am going to print lines @mailBoundaries[$
+i] to @mailBoundaries[$i+1] from $FileToProcess to $FileNameToWrite.\
+n";

                    open (FILETOWRITE, ">>$FileNameToWrite") or die "C
+an't open $FileToProcess: $!\n";
                    
                    print FILETOWRITE $_;
                                
                }
                
            }
        
        close (FILETOREAD);
                    
    $i+=1;
    
    }
        
}

foreach $FileToHandle (map { glob } @ARGV) {
    ParseEmail($FileToHandle);
}
[download]

One way to improve it might be to simply store all the preceding lines in a buffer array, and when I encounter a "From," instead of recording the blank line numbers, write that array to a file. I think this borrows a page from your book inasmuch as I'd be reading more than just a line at a time, but I haven't yet worked out how to know when I have encompassed an "interesting chunk" of the source data and can write it to a file. Working on that now. I'm really not sure how much of performance hit I should expect for very large arrays (considering many of these e-mails may have very large attachments which are even larger when rendered as base64). I would appreciate your thoughts on that as well.

Thanks again--especially for the link to perlvar, very educational.

In reply to Re^2: Subsetting text files containing e-mails by PeterCap
in thread Subsetting text files containing e-mails by PeterCap

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.