Re: Parsing MercuryMail logs

That, my friend, is one ugly looking problem.

Without being able to provide a comprehensive fix, let me throw out some tidbits that have helped me in the past when tackling this kind of problem.

How big is your dataset?
If it's small, maybe you can examine your data by hand and come up with a pretty good feel for the subtleties of the dataset - like this overlap. If it's bigger, then you might wind up writing small throwaway or incremental programs just to analyze the nature of the data.
Some of the things you might find / some things to look for:
- Is there more than one mail message sent per connection? If so, timestamps may be off, since the example you gave shows them tied to connect/disconnects. A long-lived connection could span hours, and have multiple messages. Maybe not a problem, but something to be aware of.
- As pointed out in the Altivore project (an open-source version of the FBI's Carnivore mail snooper), SMTP connections aren't always as cut and dried as you've seen so far. Lost connections, RSET commands, and the like can botch up your parsing if you're not looking for them. In the worst case, you could wind up creating and trying to synchronize multiple state machines that understand SMTP - yech. (Although an interesting problem!)

Along those lines, how big is the overlap problem?

Write a small script to count how many "clean" log entries there are - sequential entries that cover an entire connection, or maybe just an entire message (MAIL from, RCPT to, DATA). Keep counts of clean vs. interleaved log sequences - maybe there's an acceptable loss if you discard the 10% which are interleaved? Depends on your specific requirements. Some code might look like:

# Use a stack to keep track of concurrent connections
my @connections;

while (<MERLOG>) {
  # Assume that seeing "Connection" before seeing "QUIT"
  # implies an interleaved connection record.
  if (/^Connection/) {
    ++$count_of_connections;
    push (@connections, $_);
    ++$count_of_overlaps if (scalar(@connections) > 1);
  }
  pop (@connections) if /^QUIT/
}

print "$count_of_connections connections logged, $count_of_overlaps ov
+erlapped\n";
print "Leaving ", $count_of_connections - $count_of_overlapped, " clea
+n connections logged.\n"
[download]

Assuming you've got a good sized dataset, cache data extracted from intesive queries. If it takes 20 minutes to assemble the list of all users that received mail, cache that result away somewhere, and you'll save 20 minutes (possibly 20 minutes times the N tries it takes to get something right) the next time you need that data - maybe to correlate it with senders extracted by another script. You'll have great luck using Data::Dumper for that, bundled with most modern versions of Perl.

At the end of the day, it may not be possible to provide more information than a set of everyone that's received mail, and a set of addresses that the mail was from. Keep in mind that this is a problem that you might be able to get "close enough" with some effort, but the "perfect answer" might not be possible, or practical.

Peace,
-McD

Comment on Re: Parsing MercuryMail logs Download Code