The format of the database file is mostly binary data (I don't care about), with messages (I do care about) interspersed. Each message starts off with a sequence of two or more null chars followed by the string 'MSrc'. This is followed by 16 bytes of more binary data, which is then followed by the mail headers and finally the mail message itself, none of which contains any null chars. After all that is two or more nulls followed by either another message, or more binary data I don't care about.
Now, if I had all the data as one massive string, I could brute force through it with a regex like the following, and just write each recovered message to an individual text file.
Trouble is, the database file itself is > 12GB, so I need to read it in and process it in much smaller pieces. Obviously, this would require the use of read and a while loop, but how do I ensure that I don't skip any messages or cut them off midway in between reads?m/ \0{2} # two leading nulls MSrc # signal of a new message .{16} # 16 bytes of binary data ([^\0]+?) # what i want, mail headers and message /msx
I thought about, for example, reading in N bytes and processing, then seeking back N/2 bytes and reading N bytes again, etcetera. But in that case, how do I ensure I don't save the same messages twice? I can't quite wrap my head about the logic that I'd need.
In reply to Parsing 12GB Entourage database in pieces... by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |