Parsing 12GB Entourage database in pieces...

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing 12GB Entourage database in pieces... by ikegami (Patriarch) on Aug 28, 2008 at 19:13 UTC
Use a buffer. [how do I ensure that I don't messages] off midway in between reads? When you've found a message, add to the buffer until you find the end of message. how do I ensure I don't save the same messages twice? Remove them from the buffer as you process them. how do I ensure that I don't skip any messages I don't see how that could happen. Untested solution: my $BLOCK_SIZE = 64 * 1024; # Until EOF, for (;;) { # Read a chunk. my $rv = read($fh, my $buf='', $BLOCK_SIZE); die if !defined($rv); # I/O error. last if !$rv; # EOF # For each message in the buffer, for (;;) { my $pos = index($buf, "\0\0Msrc"); last if $pos < 0; # Discard the uninteresting bits. substr($buf, 0, $pos, ''); # Read the header. while (length($buf) < 22) { my $rv = read($fh, $buf, $BLOCK_SIZE, length($buf)); die if !defined($rv); # I/O error. die if !$rv; # Incomplete message } # Discard the header. substr($buf, 0, 22, ''); # Locate the end of the message. my $pos = 0; for (;;) { $pos = index($buf, "\0", $pos); last if $pos >= 0; $pos = length($buf); # Read a chunk. my $rv = read($fh, $buf, $BLOCK_SIZE, length($buf)); die if !defined($rv); # I/O error. die if !$rv; # Incomplete message } # Extract the message. my $msg = substr($buf, 0, $pos, ''); # Process the message. do_something($msg); } } [download] Others ([1] [2] [3] [4]) have assumed the amount of data between messages is rather short. My solution makes no such assumption. Update: Added answers to direct questions. Update: Added note concerning assumptions.	[reply] [d/l]
Re: Parsing 12GB Entourage database in pieces... by moritz (Cardinal) on Aug 28, 2008 at 19:05 UTC
I think the trick is to only ever append at the end of the string, and to destroy the string when you found something: `my $str = ''; my $tmp; while (read($handle, $tmp, 32*1024)) { $str .= $tmp; while ($str =~ m/($re)/g) { # have we read up to the end? if (pos($str) == length($str) - 1) { # yes - discard our attempt last; } my $data = $1; # do something with $data here # and now destruct our pour string: $str = substr($str, pos($str)); # and start from the beginning again pos($str) = 0; } }` [download] This code is untested, but it should give you an idea of how to do it.	[reply] [d/l]
Re^2: Parsing 12GB Entourage database in pieces... by ikegami (Patriarch) on Aug 28, 2008 at 19:26 UTC
`my $str = ''; my $tmp; while (read($handle, $tmp, 321024)) { $str .= $tmp;` [download] can be written as `my $str = ''; while (read($handle, $str, 321024, length($str))) {` [download] Also, `pos($str) = 0;` [download] is already done by `$str = substr($str, pos($str));` [download] (Well, `pos($str) = undef;` is.)	[reply] [d/l] [select]
Re: Parsing 12GB Entourage database in pieces... by tilly (Archbishop) on Aug 28, 2008 at 21:31 UTC
This seems like exactly the kind of problem that File::Stream is designed for.	[reply]
Re: Parsing 12GB Entourage database in pieces... by kyle (Abbot) on Aug 28, 2008 at 19:13 UTC
Since your data of interest never has nulls in it but always has nulls between them, I'd say set $/ to that and go to it. `$/ = "\0\0"; while (<>) { chomp; if ( m{ \A MSrc .{16} (.+) \z }xms ) { # message in $1 } }` [download] Does that work?	[reply] [d/l]
Re^2: Parsing 12GB Entourage database in pieces... by moritz (Cardinal) on Aug 28, 2008 at 19:22 UTC
Since your data of interest never has nulls in it remember that the 16 bytes of binary data could contain "\0\0", and every time this actually happens one looses that message with your method.	[reply]
Re^3: Parsing 12GB Entourage database in pieces... by kyle (Abbot) on Aug 28, 2008 at 19:29 UTC
Hmmm, good point. Maybe... `$/ = "\0\0"; while (<>) { $_ .= <> while ( ! eof && length() < length("MSrc$/")+16 ); # if ... }` [download]	[reply] [d/l]
Re: Parsing 12GB Entourage database in pieces... by betterworld (Curate) on Aug 28, 2008 at 19:07 UTC
Maybe you could set `$/` to `"\0\0MSrc"` and read the file record by record. You just have to hope that this string does not occur in "16 bytes of binary data" anywhere.	[reply] [d/l] [select]
Re^2: Parsing 12GB Entourage database in pieces... by ikegami (Patriarch) on Aug 28, 2008 at 19:18 UTC
If there's 1GB of data between two messages, that will attempt to read that 1GB into memory.	[reply]
Re^3: Parsing 12GB Entourage database in pieces... by kyle (Abbot) on Aug 28, 2008 at 20:11 UTC
I could still be wrong, but I have a hard time believing that all that is really necessary. I wrote but didn't test this: `my $msg_marker = "\0\0MSrc"; my $tiny_read = 16 + length( $msg_marker ); while ( ! eof ) { $/ = \$tiny_read; $_ = <>; # marker in here? while ( -1 == index $_, $msg_marker and ! eof ) { # chop the beginning if still no marker $_ = substr $_, $tiny_read if length > $tiny_read; $_ .= <>; } $_ .= <>; # make sure to get those 16 bytes $/ = "\0\0"; $_ .= <>; # read to the end of the message message_in_here( $_ ); }` [download] The down side is that I'm reading 12G in 22 byte increments (except during messages). That might be too slow. On the other hand, it's short and fairly comprehensible (especially if you give names to things I didn't).	[reply] [d/l]
Re^4: Parsing 12GB Entourage database in pieces... by ikegami (Patriarch) on Aug 29, 2008 at 00:15 UTC
Re: Parsing 12GB Entourage database in pieces... by betterworld (Curate) on Aug 28, 2008 at 19:52 UTC
There might be a completely different solution. Using Sys::Mmap, you can map the entire file to a single string, and then brute-force through that string with something like `while ($string =~ m/.../g) {...}` (like you hinted in your post). However, there are some caveats: Not all operating systems support mmap; you'd need a 64 bit system to fit those 12GB into your program's address space; I don't know how regular expressions perform with such a huge string. I've just tried it on a mmapped 200MB file (most of which consisted of \0) and it did quite well, but that's much smaller than your file.	[reply] [d/l]
Re^2: Parsing 12GB Entourage database in pieces... by ikegami (Patriarch) on Aug 28, 2008 at 20:06 UTC
I don't know how regular expressions perform with such a huge string. Some uses of "``" are equivalent to "`{0,32767}`", so you might have problems. `>perl -Mre=debug -we"qr/^(.)(\1)\z/" ... 9: CURLYX[1] {0,32767}(14) ...` [download] Be sure to prevent backtracking using `(?>...)` or (in 5.10.0+) the possessive quantifier. Update: "`/\0\0MSrc.{16}((?>[^\0]*))(?=\0)/s`" looks safe.	[reply] [d/l] [select]
Re^3: Parsing 12GB Entourage database in pieces... by betterworld (Curate) on Aug 28, 2008 at 20:40 UTC
Interesting... `perl5.8.8 -wle '$s = "x" x 40_000; $s =~ /^(.)(\1)/ and print length $2' # (segfaults) perl5.10.0 -wle '$s = "x" x 40_000; $s =~ /^(.)(\1)/ and print length $2' Complex regular subexpression recursion limit (32766) exceeded at -e l +ine 1. 32767` [download] Well, at least it seems to warn when this limitation affects the result...	[reply] [d/l]
Re^4: Parsing 12GB Entourage database in pieces... by ikegami (Patriarch) on Aug 29, 2008 at 00:21 UTC