comment on

I'm trying to recover emails from a corrupt Entourage database, and I know what I have to do to get them from the db file, but can't quite wrap my head around how to do it.

The format of the database file is mostly binary data (I don't care about), with messages (I do care about) interspersed. Each message starts off with a sequence of two or more null chars followed by the string 'MSrc'. This is followed by 16 bytes of more binary data, which is then followed by the mail headers and finally the mail message itself, none of which contains any null chars. After all that is two or more nulls followed by either another message, or more binary data I don't care about.

Now, if I had all the data as one massive string, I could brute force through it with a regex like the following, and just write each recovered message to an individual text file.

m/
\0{2}      # two leading nulls
MSrc       # signal of a new message
.{16}      # 16 bytes of binary data
([^\0]+?)  # what i want, mail headers and message
/msx
[download]

Trouble is, the database file itself is > 12GB, so I need to read it in and process it in much smaller pieces. Obviously, this would require the use of read and a while loop, but how do I ensure that I don't skip any messages or cut them off midway in between reads?

I thought about, for example, reading in N bytes and processing, then seeking back N/2 bytes and reading N bytes again, etcetera. But in that case, how do I ensure I don't save the same messages twice? I can't quite wrap my head about the logic that I'd need.

In reply to Parsing 12GB Entourage database in pieces... by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.