in reply to Re: Subsetting text files containing e-mails
in thread Subsetting text files containing e-mails
Much appreciated! However, I think you're assuming that each e-mail will begin with '^From: ', which is not the case since the SMTP envelope may contain any arbitrary number of lines before then (e.g., 'x-sender:') (I think if I could rely on every e-mail to start with the same sequence--or to have any guaranteed structure--then this might certainly be easier! But from reading the germane RFCs I can't count on that structure necessarily).
Instead, how I have defined the problem is this:
So, it occurred to me that all I need to do is read through the file line-by-line and keep track of three line numbers:
I can find and store these in an array and then do a second pass through the file to subset.
Could I get your opinion on the following? It works, but I am certain it can be improved.
#!/usr/bin/env perl use strict; use Getopt::Std; my %opts; my $FileToHandle; getopts('o:', \%opts); my $k = 0; sub ParseEmail { my $FileToProcess = $_[0]; my @mailBoundaries=(); my $myLine = 0; my $recentBlank = 1; my $previousBlank = 1; # First pass to find out where to split the file... open (FILETOREAD, $FileToProcess) or die "Can't open $FileToProces +s: $!\n"; while (<FILETOREAD>) { if (/^$/) { $recentBlank = $myLine+1; print "Recent Blank: $recentBlank\n"; } if (/^From: / && $previousBlank != $recentBlank) { push(@mailBoundaries, $previousBlank); $previousBlank = $recentBlank; print "PreviousBlank: $previousBlank\n"; } if (eof && $previousBlank == 1) { push(@mailBoundaries, $previousBlank); push(@mailBoundaries, $myLine+1); } elsif (eof && $previousBlank != 1) { push(@mailBoundaries, $myLine+1); } $myLine+=1; print "My Line: $myLine\n"; } close (FILETOREAD); # Second pass to subset the file my $i = 0; while ($i <= ($#mailBoundaries - 1)) { $k+=1; my $j = 0; open (FILETOREAD, $FileToProcess) or die "Can't open $FileToPr +ocess: $!\n"; while (<FILETOREAD>) { $j+=1; if ($j >= @mailBoundaries[$i] && $j <= @mailBoundaries +[$i+1]) { my $FileNameToWrite = $opts{'o +'} . "_" . sprintf("%06d", $k); print "Mail Boundaries:"; print map { "$_ \n" } @mailBoundaries; print "\n"; print "I am going to print lines @mailBoundaries[$ +i] to @mailBoundaries[$i+1] from $FileToProcess to $FileNameToWrite.\ +n"; open (FILETOWRITE, ">>$FileNameToWrite") or die "C +an't open $FileToProcess: $!\n"; print FILETOWRITE $_; } } close (FILETOREAD); $i+=1; } } foreach $FileToHandle (map { glob } @ARGV) { ParseEmail($FileToHandle); }
One way to improve it might be to simply store all the preceding lines in a buffer array, and when I encounter a "From," instead of recording the blank line numbers, write that array to a file. I think this borrows a page from your book inasmuch as I'd be reading more than just a line at a time, but I haven't yet worked out how to know when I have encompassed an "interesting chunk" of the source data and can write it to a file. Working on that now. I'm really not sure how much of performance hit I should expect for very large arrays (considering many of these e-mails may have very large attachments which are even larger when rendered as base64). I would appreciate your thoughts on that as well.
Thanks again--especially for the link to perlvar, very educational.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Subsetting text files containing e-mails
by GrandFather (Saint) on Jan 27, 2012 at 07:34 UTC | |
by PeterCap (Initiate) on Jan 27, 2012 at 08:26 UTC | |
by GrandFather (Saint) on Jan 27, 2012 at 09:17 UTC |