Much appreciated! However, I think you're assuming that each e-mail will begin with '^From: ', which is not the case since the SMTP envelope may contain any arbitrary number of lines before then (e.g., 'x-sender:') (I think if I could rely on every e-mail to start with the same sequence--or to have any guaranteed structure--then this might certainly be easier! But from reading the germane RFCs I can't count on that structure necessarily).
Instead, how I have defined the problem is this:
So, it occurred to me that all I need to do is read through the file line-by-line and keep track of three line numbers:
I can find and store these in an array and then do a second pass through the file to subset.
Could I get your opinion on the following? It works, but I am certain it can be improved.
#!/usr/bin/env perl use strict; use Getopt::Std; my %opts; my $FileToHandle; getopts('o:', \%opts); my $k = 0; sub ParseEmail { my $FileToProcess = $_[0]; my @mailBoundaries=(); my $myLine = 0; my $recentBlank = 1; my $previousBlank = 1; # First pass to find out where to split the file... open (FILETOREAD, $FileToProcess) or die "Can't open $FileToProces +s: $!\n"; while (<FILETOREAD>) { if (/^$/) { $recentBlank = $myLine+1; print "Recent Blank: $recentBlank\n"; } if (/^From: / && $previousBlank != $recentBlank) { push(@mailBoundaries, $previousBlank); $previousBlank = $recentBlank; print "PreviousBlank: $previousBlank\n"; } if (eof && $previousBlank == 1) { push(@mailBoundaries, $previousBlank); push(@mailBoundaries, $myLine+1); } elsif (eof && $previousBlank != 1) { push(@mailBoundaries, $myLine+1); } $myLine+=1; print "My Line: $myLine\n"; } close (FILETOREAD); # Second pass to subset the file my $i = 0; while ($i <= ($#mailBoundaries - 1)) { $k+=1; my $j = 0; open (FILETOREAD, $FileToProcess) or die "Can't open $FileToPr +ocess: $!\n"; while (<FILETOREAD>) { $j+=1; if ($j >= @mailBoundaries[$i] && $j <= @mailBoundaries +[$i+1]) { my $FileNameToWrite = $opts{'o +'} . "_" . sprintf("%06d", $k); print "Mail Boundaries:"; print map { "$_ \n" } @mailBoundaries; print "\n"; print "I am going to print lines @mailBoundaries[$ +i] to @mailBoundaries[$i+1] from $FileToProcess to $FileNameToWrite.\ +n"; open (FILETOWRITE, ">>$FileNameToWrite") or die "C +an't open $FileToProcess: $!\n"; print FILETOWRITE $_; } } close (FILETOREAD); $i+=1; } } foreach $FileToHandle (map { glob } @ARGV) { ParseEmail($FileToHandle); }
One way to improve it might be to simply store all the preceding lines in a buffer array, and when I encounter a "From," instead of recording the blank line numbers, write that array to a file. I think this borrows a page from your book inasmuch as I'd be reading more than just a line at a time, but I haven't yet worked out how to know when I have encompassed an "interesting chunk" of the source data and can write it to a file. Working on that now. I'm really not sure how much of performance hit I should expect for very large arrays (considering many of these e-mails may have very large attachments which are even larger when rendered as base64). I would appreciate your thoughts on that as well.
Thanks again--especially for the link to perlvar, very educational.
In reply to Re^2: Subsetting text files containing e-mails
by PeterCap
in thread Subsetting text files containing e-mails
by PeterCap
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |