in reply to Parsing 12GB Entourage database in pieces...

Maybe you could set $/ to "\0\0MSrc" and read the file record by record. You just have to hope that this string does not occur in "16 bytes of binary data" anywhere.

Replies are listed 'Best First'.
Re^2: Parsing 12GB Entourage database in pieces...
by ikegami (Patriarch) on Aug 28, 2008 at 19:18 UTC
    If there's 1GB of data between two messages, that will attempt to read that 1GB into memory.

      I could still be wrong, but I have a hard time believing that all that is really necessary. I wrote but didn't test this:

      my $msg_marker = "\0\0MSrc"; my $tiny_read = 16 + length( $msg_marker ); while ( ! eof ) { $/ = \$tiny_read; $_ = <>; # marker in here? while ( -1 == index $_, $msg_marker and ! eof ) { # chop the beginning if still no marker $_ = substr $_, $tiny_read if length > $tiny_read; $_ .= <>; } $_ .= <>; # make sure to get those 16 bytes $/ = "\0\0"; $_ .= <>; # read to the end of the message message_in_here( $_ ); }

      The down side is that I'm reading 12G in 22 byte increments (except during messages). That might be too slow. On the other hand, it's short and fairly comprehensible (especially if you give names to things I didn't).

        Ignoring the error checking you didn't do, my solution is not that much longer than yours (15 to 12), and I'm sure I could lose three lines by combining statements as you did. I think it could be shortened by combining the searching for the marker and the 16 bytes, but I kept it straightforward.

        Two differences:

        • I didn't assume the trailing "\0\0" couldn't be the start of another message.
        • I work on much bigger blocks, which *could* be much faster.

        By the way, couldn't
        $_ = substr $_, $tiny_read if length > $tiny_read;
        be written as
        $_ = substr $_, -$tiny_read;
        Then you could combine it with the following statement
        $_ = substr( $_, -$tiny_read ) . <>;