That, my friend, is one ugly looking problem.
Without being able to provide a comprehensive fix, let me throw
out some tidbits that have helped me in the past when tackling
this kind of problem.
- How big is your dataset?
If it's small, maybe you can examine your data by hand and come up
with a pretty good feel for the subtleties of the dataset - like
this overlap. If it's bigger, then you might wind up writing
small throwaway or incremental programs just to analyze the nature of the data.
Some of the things you might find / some things to look for:
- Is there more than one mail message sent per connection? If so,
timestamps may be off, since the example you gave shows them tied to
connect/disconnects. A long-lived connection could span hours, and have
multiple messages. Maybe not a problem, but something to be aware of.
- As pointed out in the Altivore project (an open-source
version of the FBI's Carnivore mail snooper), SMTP connections aren't always
as cut and dried as you've seen so far. Lost connections, RSET commands, and the like
can botch up your parsing if you're not looking for them. In the worst case,
you could wind up creating and trying to synchronize multiple state machines
that understand SMTP - yech. (Although an interesting problem!)
- Along those lines, how big is the overlap problem?
Write a small script to count how many "clean" log entries there are -
sequential entries that cover an entire connection, or maybe just an
entire message (MAIL from, RCPT to, DATA). Keep counts of clean vs. interleaved
log sequences - maybe there's an acceptable loss if you discard the 10% which are
interleaved? Depends on your specific requirements. Some code might look like:
# Use a stack to keep track of concurrent connections
my @connections;
while (<MERLOG>) {
# Assume that seeing "Connection" before seeing "QUIT"
# implies an interleaved connection record.
if (/^Connection/) {
++$count_of_connections;
push (@connections, $_);
++$count_of_overlaps if (scalar(@connections) > 1);
}
pop (@connections) if /^QUIT/
}
print "$count_of_connections connections logged, $count_of_overlaps ov
+erlapped\n";
print "Leaving ", $count_of_connections - $count_of_overlapped, " clea
+n connections logged.\n"
- Assuming you've got a good sized dataset, cache data extracted
from intesive queries. If it takes 20 minutes to assemble the list
of all users that received mail, cache that result away somewhere, and you'll
save 20 minutes (possibly 20 minutes times the N tries it takes to get something right)
the next time you need that data - maybe to correlate it with senders extracted
by another script. You'll have great luck using Data::Dumper for that,
bundled with most modern versions of Perl.
At the end of the day, it may not be possible to provide more
information than a set of everyone that's received mail, and
a set of addresses that the mail was from. Keep in mind that this
is a problem that you might be able to get "close enough" with some effort,
but the "perfect answer" might not be possible, or practical.
Peace,
-McD