Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to recover emails from a corrupt Entourage database, and I know what I have to do to get them from the db file, but can't quite wrap my head around how to do it.

The format of the database file is mostly binary data (I don't care about), with messages (I do care about) interspersed. Each message starts off with a sequence of two or more null chars followed by the string 'MSrc'. This is followed by 16 bytes of more binary data, which is then followed by the mail headers and finally the mail message itself, none of which contains any null chars. After all that is two or more nulls followed by either another message, or more binary data I don't care about.

Now, if I had all the data as one massive string, I could brute force through it with a regex like the following, and just write each recovered message to an individual text file.

m/ \0{2} # two leading nulls MSrc # signal of a new message .{16} # 16 bytes of binary data ([^\0]+?) # what i want, mail headers and message /msx
Trouble is, the database file itself is > 12GB, so I need to read it in and process it in much smaller pieces. Obviously, this would require the use of read and a while loop, but how do I ensure that I don't skip any messages or cut them off midway in between reads?

I thought about, for example, reading in N bytes and processing, then seeking back N/2 bytes and reading N bytes again, etcetera. But in that case, how do I ensure I don't save the same messages twice? I can't quite wrap my head about the logic that I'd need.

Replies are listed 'Best First'.
Re: Parsing 12GB Entourage database in pieces...
by ikegami (Patriarch) on Aug 28, 2008 at 19:13 UTC

    Use a buffer.

    [how do I ensure that I don't messages] off midway in between reads?

    When you've found a message, add to the buffer until you find the end of message.

    how do I ensure I don't save the same messages twice?

    Remove them from the buffer as you process them.

    how do I ensure that I don't skip any messages

    I don't see how that could happen.

    Untested solution:

    my $BLOCK_SIZE = 64 * 1024; # Until EOF, for (;;) { # Read a chunk. my $rv = read($fh, my $buf='', $BLOCK_SIZE); die if !defined($rv); # I/O error. last if !$rv; # EOF # For each message in the buffer, for (;;) { my $pos = index($buf, "\0\0Msrc"); last if $pos < 0; # Discard the uninteresting bits. substr($buf, 0, $pos, ''); # Read the header. while (length($buf) < 22) { my $rv = read($fh, $buf, $BLOCK_SIZE, length($buf)); die if !defined($rv); # I/O error. die if !$rv; # Incomplete message } # Discard the header. substr($buf, 0, 22, ''); # Locate the end of the message. my $pos = 0; for (;;) { $pos = index($buf, "\0", $pos); last if $pos >= 0; $pos = length($buf); # Read a chunk. my $rv = read($fh, $buf, $BLOCK_SIZE, length($buf)); die if !defined($rv); # I/O error. die if !$rv; # Incomplete message } # Extract the message. my $msg = substr($buf, 0, $pos, ''); # Process the message. do_something($msg); } }

    Others ([1] [2] [3] [4]) have assumed the amount of data between messages is rather short. My solution makes no such assumption.

    Update: Added answers to direct questions.
    Update: Added note concerning assumptions.

Re: Parsing 12GB Entourage database in pieces...
by moritz (Cardinal) on Aug 28, 2008 at 19:05 UTC
    I think the trick is to only ever append at the end of the string, and to destroy the string when you found something:
    my $str = ''; my $tmp; while (read($handle, $tmp, 32*1024)) { $str .= $tmp; while ($str =~ m/($re)/g) { # have we read up to the end? if (pos($str) == length($str) - 1) { # yes - discard our attempt last; } my $data = $1; # do something with $data here # and now destruct our pour string: $str = substr($str, pos($str)); # and start from the beginning again pos($str) = 0; } }

    This code is untested, but it should give you an idea of how to do it.

      my $str = ''; my $tmp; while (read($handle, $tmp, 32*1024)) { $str .= $tmp;

      can be written as

      my $str = ''; while (read($handle, $str, 32*1024, length($str))) {

      Also,

      pos($str) = 0;

      is already done by

      $str = substr($str, pos($str));

      (Well, pos($str) = undef; is.)

Re: Parsing 12GB Entourage database in pieces...
by tilly (Archbishop) on Aug 28, 2008 at 21:31 UTC
    This seems like exactly the kind of problem that File::Stream is designed for.
Re: Parsing 12GB Entourage database in pieces...
by kyle (Abbot) on Aug 28, 2008 at 19:13 UTC

    Since your data of interest never has nulls in it but always has nulls between them, I'd say set $/ to that and go to it.

    $/ = "\0\0"; while (<>) { chomp; if ( m{ \A MSrc .{16} (.+) \z }xms ) { # message in $1 } }

    Does that work?

      Since your data of interest never has nulls in it

      remember that the 16 bytes of binary data could contain "\0\0", and every time this actually happens one looses that message with your method.

        Hmmm, good point. Maybe...

        $/ = "\0\0"; while (<>) { $_ .= <> while ( ! eof && length() < length("MSrc$/")+16 ); # if ... }
Re: Parsing 12GB Entourage database in pieces...
by betterworld (Curate) on Aug 28, 2008 at 19:07 UTC

    Maybe you could set $/ to "\0\0MSrc" and read the file record by record. You just have to hope that this string does not occur in "16 bytes of binary data" anywhere.

      If there's 1GB of data between two messages, that will attempt to read that 1GB into memory.

        I could still be wrong, but I have a hard time believing that all that is really necessary. I wrote but didn't test this:

        my $msg_marker = "\0\0MSrc"; my $tiny_read = 16 + length( $msg_marker ); while ( ! eof ) { $/ = \$tiny_read; $_ = <>; # marker in here? while ( -1 == index $_, $msg_marker and ! eof ) { # chop the beginning if still no marker $_ = substr $_, $tiny_read if length > $tiny_read; $_ .= <>; } $_ .= <>; # make sure to get those 16 bytes $/ = "\0\0"; $_ .= <>; # read to the end of the message message_in_here( $_ ); }

        The down side is that I'm reading 12G in 22 byte increments (except during messages). That might be too slow. On the other hand, it's short and fairly comprehensible (especially if you give names to things I didn't).

Re: Parsing 12GB Entourage database in pieces...
by betterworld (Curate) on Aug 28, 2008 at 19:52 UTC

    There might be a completely different solution. Using Sys::Mmap, you can map the entire file to a single string, and then brute-force through that string with something like while ($string =~ m/.../g) {...} (like you hinted in your post).

    However, there are some caveats:

    • Not all operating systems support mmap;
    • you'd need a 64 bit system to fit those 12GB into your program's address space;
    • I don't know how regular expressions perform with such a huge string. I've just tried it on a mmapped 200MB file (most of which consisted of \0) and it did quite well, but that's much smaller than your file.

      I don't know how regular expressions perform with such a huge string.

      Some uses of "*" are equivalent to "{0,32767}", so you might have problems.

      >perl -Mre=debug -we"qr/^(.)(\1*)\z/" ... 9: CURLYX[1] {0,32767}(14) ...

      Be sure to prevent backtracking using (?>...) or (in 5.10.0+) the possessive quantifier.

      Update: "/\0\0MSrc.{16}((?>[^\0]*))(?=\0)/s" looks safe.

        Interesting...

        perl5.8.8 -wle '$s = "x" x 40_000; $s =~ /^(.)(\1*)/ and print length $2' # (segfaults) perl5.10.0 -wle '$s = "x" x 40_000; $s =~ /^(.)(\1*)/ and print length $2' Complex regular subexpression recursion limit (32766) exceeded at -e l +ine 1. 32767

        Well, at least it seems to warn when this limitation affects the result...