Re: How to efficiently compress a Berkeley database?

The thing is that I compress each article separatlly, and I'm pretty sure I'd get much better result if I could compress them with a shared dictionnary or something.

This is a classical time vs. space problem. If you compress the articles together, every time you want to read one you have to (at least) uncompress all the others that came before it.

Instead of Zlib compression, you can try BZip2 or LZMA. They usually compress much better, see for example the modules IO::Compress::Bzip2 or IO::Compress::Lzma respectively.

You could, in theory, use Archive::Tar to compress multiple articles into a single file when the optional IO::Zlib is installed.

Does any of this really gives you a better compression ratio at all and if it does, how much will it affect your loading time? Well, you really have to build a few simple test cases with a few hundred randomly selected articles, i guess. I think using Bzip2 or LZMA could actually improve both, since CPU's are generally very fast at decompressing and you'll use less bandwidth from the harddisk. But generating the data will be very slow.

As for Archive::Tar, my guess is it will slow things down while not saving any relevant space compared to your existing solution of using GZip.

But, as i said, you should really test it for yourself with a relevant (randomly selected) subset of the data you will use in the full project. Only this will give you the best view of space/time tradeoffs relevant to your project.

BREW /very/strong/coffee HTTP/1.1
Host: goodmorning.example.com

418 I'm a teapot

Comment on Re: How to efficiently compress a Berkeley database?

Replies are listed 'Best First'.
Re^2: How to efficiently compress a Berkeley database? by grondilu (Friar) on Jan 04, 2012 at 10:40 UTC
I'm not sure using a better algorithm would help much, as the difference is significative only with large texts, and most wikipedia articles are rather short. But that's worth trying indeed.	[reply]