Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to do some auditing on an old MercuryMail server which we are attempting to phaze out. What I'd like to do is generate a file containing all the accounts which have received mail over a given period, all the addresses which mailed them, and the dates so that we might mail that information out to the users who still have active forwards on the server.

The problem is that the logger on Mercury is pretty terrible. There are no unique transaction numbers like in Sendmail and the SMTP traffic tends to overlap. The end result is a logfile which looks like this:

Connection from 111.222.333.444, Mon Feb 12 08:53:27 2001 EHLO mhub6.tc.umn.edu MAIL FROM:<user1@myotherhost.mydomain.net> RCPT TO:<user2@myhost.mydomain.net> DATA - 1216 lines, 83052 bytes. Connection from 64.4.15.226, Mon Feb 12 08:53:34 2001 EHLO hotmail.com MAIL FROM:<paica@hotmail.com> RCPT TO:<user4@myhost.mydomain.net> DATA - 29 lines, 1215 bytes. QUIT - 1 sec. elapsed, connection closed Mon Feb 12 08:53:36 2001 QUIT - 19 sec. elapsed, connection closed Mon Feb 12 08:53:46 2001 Connection from 208.1.71.165, Mon Feb 12 08:54:01 2001 Connection from 208.1.71.165, Mon Feb 12 08:54:01 2001 Connection from 208.1.71.165, Mon Feb 12 08:54:01 2001 EHLO ancmail02.ancestry.com MAIL FROM:<Reminder@MyFamilyInc.com> MAIL FROM:<Reminder@MyFamilyInc.com> MAIL FROM:<Reminder@MyFamilyInc.com> RCPT TO:<user3@myhost.mydomain.net> RCPT TO:<user3@myhost.mydomain.net> DATA - 249 lines, 14229 bytes. DATA - 217 lines, 12388 bytes. DATA - 217 lines, 12403 bytes. QUIT - 2 sec. elapsed, connection closed Mon Feb 12 08:54:03 2001 QUIT - 2 sec. elapsed, connection closed Mon Feb 12 08:54:03 2001 QUIT - 2 sec. elapsed, connection closed Mon Feb 12 08:54:03 2001
Anyone have suggestions on how to tackle something like this? i tried to reads with an unless MAIL From but that doesn't solve the problem of overlap. This is what I have been playing with:

sub ParseIt { my $text = ''; open (MERLOG,$log) || die "Unable to open $log:$!\n"; while (<MERLOG>) { unless (/MAIL From.*?\n/smi) { # Continuation of record in progress $text .= $_; } else { # Found a new record examine_record($text); # $text = $_; } } # caveat for the last record because it will not be followed b +y a new record examine_record($text); }
I'm a newbie, be gentle.

Edit: 2001-03-05 chipmunk

Replies are listed 'Best First'.
Re: Parsing MercuryMail logs
by McD (Chaplain) on Mar 06, 2001 at 01:30 UTC
    That, my friend, is one ugly looking problem.

    Without being able to provide a comprehensive fix, let me throw out some tidbits that have helped me in the past when tackling this kind of problem.

    • How big is your dataset?

      If it's small, maybe you can examine your data by hand and come up with a pretty good feel for the subtleties of the dataset - like this overlap. If it's bigger, then you might wind up writing small throwaway or incremental programs just to analyze the nature of the data.

      Some of the things you might find / some things to look for:

      • Is there more than one mail message sent per connection? If so, timestamps may be off, since the example you gave shows them tied to connect/disconnects. A long-lived connection could span hours, and have multiple messages. Maybe not a problem, but something to be aware of.
      • As pointed out in the Altivore project (an open-source version of the FBI's Carnivore mail snooper), SMTP connections aren't always as cut and dried as you've seen so far. Lost connections, RSET commands, and the like can botch up your parsing if you're not looking for them. In the worst case, you could wind up creating and trying to synchronize multiple state machines that understand SMTP - yech. (Although an interesting problem!)

    • Along those lines, how big is the overlap problem?

      Write a small script to count how many "clean" log entries there are - sequential entries that cover an entire connection, or maybe just an entire message (MAIL from, RCPT to, DATA). Keep counts of clean vs. interleaved log sequences - maybe there's an acceptable loss if you discard the 10% which are interleaved? Depends on your specific requirements. Some code might look like:

      # Use a stack to keep track of concurrent connections my @connections; while (<MERLOG>) { # Assume that seeing "Connection" before seeing "QUIT" # implies an interleaved connection record. if (/^Connection/) { ++$count_of_connections; push (@connections, $_); ++$count_of_overlaps if (scalar(@connections) > 1); } pop (@connections) if /^QUIT/ } print "$count_of_connections connections logged, $count_of_overlaps ov +erlapped\n"; print "Leaving ", $count_of_connections - $count_of_overlapped, " clea +n connections logged.\n"
    • Assuming you've got a good sized dataset, cache data extracted from intesive queries. If it takes 20 minutes to assemble the list of all users that received mail, cache that result away somewhere, and you'll save 20 minutes (possibly 20 minutes times the N tries it takes to get something right) the next time you need that data - maybe to correlate it with senders extracted by another script. You'll have great luck using Data::Dumper for that, bundled with most modern versions of Perl.

    At the end of the day, it may not be possible to provide more information than a set of everyone that's received mail, and a set of addresses that the mail was from. Keep in mind that this is a problem that you might be able to get "close enough" with some effort, but the "perfect answer" might not be possible, or practical.

    Peace,
    -McD

Re: Parsing MercuryMail logs
by Tuna (Friar) on Mar 05, 2001 at 23:06 UTC
    I don't have time right now to address all of the specs of your project, but here's a start. This will, at least, demonstrate one way to match the text that you're trying to match in your posted code.
    #!/usr/local/bin/perl -w use strict; #my $log = /path/to/your/log/file(s); #you will need to edit and uncomment the above line my @log_list; my $line; #open (LOG, "$log") || die "Can't open $log:$!\n"; #uncomment the above line, if you actually use this code :-) while ($line = <DATA>) { #change DATA to LOG or whatever filehandle you use, too! chomp $line; next unless ($line =~ /MAIL\s+FROM/); push (@log_list, $line); } #you will probably want to replace "next unless" with "if elsif" # in order to get the different matches that you require. foreach my $string(@log_list) { print "My string is $string\n"; #this print statement is simply to demonstrate the matching text. # obviously, you will define what YOU want to do with those matches. } __DATA__ Connection from 208.1.71.165, Mon Feb 12 08:54:01 2001 Connection from 208.1.71.165, Mon Feb 12 08:54:01 2001 Connection from 208.1.71.165, Mon Feb 12 08:54:01 2001 EHLO ancmail02.ancestry.com MAIL FROM:<Reminder@MyFamilyInc.com> MAIL FROM:<Reminder@MyFamilyInc.com> MAIL FROM:<Reminder@MyFamilyInc.com> RCPT TO:<user3@myhost.mydomain.net> RCPT TO:<user3@myhost.mydomain.net> DATA - 249 lines, 14229 bytes. DATA - 217 lines, 12388 bytes. DATA - 217 lines, 12403 bytes. __END__
Re: Parsing MercuryMail logs
by mjn (Acolyte) on Mar 06, 2001 at 01:15 UTC
    This was my question, I just had forgotten to log in prior to posting it.

    Tuna- I am interested in your suggestion on how to handle this situation but isn't what you propose just a shorter way of doing what I already have?

    What i had thought to do was make a duplicate of the original and wade through it looking for pieces in order and removing them from the copy as I find them...so, likesay, write multiple, nested, recursive searchs...

    Does that seem reasonable? I think it might work well but the execution is a bit imposing... -mjn
      You have a project, which is more involved than it seems. I suggest that you sit down, and try to pseudo-code it first.

      "Tuna- I am interested in your suggestion on how to handle this situation but isn't what you propose just a shorter way of doing what I already have? "

      No, because your regex doesn't work.