Re^3: Parsing 12GB Entourage database in pieces...

I could still be wrong, but I have a hard time believing that all that is really necessary. I wrote but didn't test this:

my $msg_marker = "\0\0MSrc";
my $tiny_read = 16 + length( $msg_marker );
while ( ! eof ) {
    $/ = \$tiny_read;
    $_ = <>;     # marker in here?
    while ( -1 == index $_, $msg_marker and ! eof ) {
        # chop the beginning if still no marker
        $_ = substr $_, $tiny_read if length > $tiny_read;
        $_ .= <>;
    }
    $_ .= <>;        # make sure to get those 16 bytes
    $/ = "\0\0";
    $_ .= <>;        # read to the end of the message

    message_in_here( $_ );
}
[download]

The down side is that I'm reading 12G in 22 byte increments (except during messages). That might be too slow. On the other hand, it's short and fairly comprehensible (especially if you give names to things I didn't).

Comment on Re^3: Parsing 12GB Entourage database in pieces... Download Code

Replies are listed 'Best First'.
Re^4: Parsing 12GB Entourage database in pieces... by ikegami (Patriarch) on Aug 29, 2008 at 00:15 UTC
Ignoring the error checking you didn't do, my solution is not that much longer than yours (15 to 12), and I'm sure I could lose three lines by combining statements as you did. I think it could be shortened by combining the searching for the marker and the 16 bytes, but I kept it straightforward. Two differences: I didn't assume the trailing "\0\0" couldn't be the start of another message. I work on much bigger blocks, which could be much faster. By the way, couldn't `$_ = substr $_, $tiny_read if length > $tiny_read;` be written as `$_ = substr $_, -$tiny_read;` Then you could combine it with the following statement `$_ = substr( $_, -$tiny_read ) . <>;`	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^4: Parsing 12GB Entourage database in pieces...
by ikegami (Patriarch) on Aug 29, 2008 at 00:15 UTC

Ignoring the error checking you didn't do, my solution is not that much longer than yours (15 to 12), and I'm sure I could lose three lines by combining statements as you did. I think it could be shortened by combining the searching for the marker and the 16 bytes, but I kept it straightforward.

Two differences:

I didn't assume the trailing "\0\0" couldn't be the start of another message.
I work on much bigger blocks, which *could* be much faster.

By the way, couldn't
$_ = substr $_, $tiny_read if length > $tiny_read;
be written as
$_ = substr $_, -$tiny_read;
Then you could combine it with the following statement
$_ = substr( $_, -$tiny_read ) . <>;

[reply]
[d/l]
[select]