in reply to Re^2: Parsing 12GB Entourage database in pieces...
in thread Parsing 12GB Entourage database in pieces...

I could still be wrong, but I have a hard time believing that all that is really necessary. I wrote but didn't test this:

my $msg_marker = "\0\0MSrc"; my $tiny_read = 16 + length( $msg_marker ); while ( ! eof ) { $/ = \$tiny_read; $_ = <>; # marker in here? while ( -1 == index $_, $msg_marker and ! eof ) { # chop the beginning if still no marker $_ = substr $_, $tiny_read if length > $tiny_read; $_ .= <>; } $_ .= <>; # make sure to get those 16 bytes $/ = "\0\0"; $_ .= <>; # read to the end of the message message_in_here( $_ ); }

The down side is that I'm reading 12G in 22 byte increments (except during messages). That might be too slow. On the other hand, it's short and fairly comprehensible (especially if you give names to things I didn't).

Replies are listed 'Best First'.
Re^4: Parsing 12GB Entourage database in pieces...
by ikegami (Patriarch) on Aug 29, 2008 at 00:15 UTC

    Ignoring the error checking you didn't do, my solution is not that much longer than yours (15 to 12), and I'm sure I could lose three lines by combining statements as you did. I think it could be shortened by combining the searching for the marker and the 16 bytes, but I kept it straightforward.

    Two differences:

    • I didn't assume the trailing "\0\0" couldn't be the start of another message.
    • I work on much bigger blocks, which *could* be much faster.

    By the way, couldn't
    $_ = substr $_, $tiny_read if length > $tiny_read;
    be written as
    $_ = substr $_, -$tiny_read;
    Then you could combine it with the following statement
    $_ = substr( $_, -$tiny_read ) . <>;