How to efficiently compress a Berkeley database?

grondilu has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'm relatively new to perl but I recently developped a cool app which I'm dying to show you and ask ideas for improvement.

I don't have a full internet access so I really wanted an off-line wikipedia viewer. Some solutions exist but I wanted something that I could totally custom.

So First I have a script that turns a Wikipedia dump into a Berkeley database:

#!/usr/bin/perl -w
use strict;
use warnings;

use DB_File;
use DBM_Filter;
use Compress::Zlib;

die "please provide a database name" unless $ARGV[0];

my $db = tie my %db, 'DB_File', "$ARGV[0].zdb", O_RDWR |O_CREAT
    or die "could not open $ARGV[0]: $!";
$db->Filter_Value_Push(
    Fetch => sub { $_ = uncompress $_ unless /^#REDIRECT/ },
    Store => sub { $_ = compress   $_ unless /^#REDIRECT/ },
);
END { undef $db, untie %db and warn "database properly closed" }

$SIG{$_} = sub { die 'caught signal ' . shift } for qw(HUP TERM INT);

my ($title, $text);
while (<STDIN>) {
    if ( s,^/mediawiki/page/title=,, ) { chomp; $title = $_ }
    elsif ( s,^/mediawiki/page/revision/text=,, ) { $text .= $_ }
    elsif ( m{^/mediawiki/page$} ) {
        $db{$title} = $text;
        undef $text;
    }
}
[download]

The input is supposed to have already been unXMLized with xml2, so the program is runned with a command such as:

$ bzcat wikipedia-dump.xml.bz2 | xml2 | perl the_script_above enwiki
[download]

Then I use a CGI script to browse the database with my local webser (thttpd). The script uses the Text::MediaWikiFormat module. It works fine so I don't really need to show it to you but I add it at the end of this post anyway.

The problem I have is that the database is still quite large, about five times more than the compressed xml dump. The thing is that I compress each article separatlly, and I'm pretty sure I'd get much better result if I could compress them with a shared dictionnary or something. I have looked in the Compress::Zlib doc but I have found no clear tip on doing that.

Any idea?

#!/usr/bin/perl -w
use strict; use warnings;
use DB_File; use DBM_Filter; use Compress::Zlib;
use CGI qw(:standard);

use utf8;
use Encode qw(encode decode);
use Text::MediawikiFormat 'wikiformat';

use constant HTML_HOME => a({ -href => $ENV{SCRIPT_NAME} }, 'back home
+' );

sub pretreatment;
sub postreatment;

my %wiki;
END { untie %wiki; }

BEGIN { print header(-charset=>'utf-8'), start_html('offline wiki') }
END { print p($@) if $@; print +HTML_HOME, end_html }

my $project  = param('project') || 'frwiki';
if ( param 'search' ) {
    my $search  = param('search');
    my $grep = 
    -f "$project.titles" ? [ 'grep', '' ] :
    -f "$project.titles.gz" ? [ qw(zgrep .gz) ] :
    -f "$project.titles.xz" ? [ qw(xzgrep  .xz) ] :
    die "No title file in CGI directory for project $project";
    open my $result, '-|', $grep->[0], '-i', $search, $project.".title
+s" . $grep->[1];
    my @result = <$result>;
    print  p [
    @result ?
    scalar(@result). " result(s) for \"$search\":" .
    ul( map li(a({ -href=>"?project=$project&title=$_" }, $_)), @resul
+t )
    :
    "No result for '$search' on $project."
    ];
}
elsif ( param 'title' ) {
    my $title = param 'title';

    my $wiki = tie %wiki, 'DB_File', "$project.zdb", O_RDONLY
    or print p "could not open wiki $project.zdb" and die;
    $wiki->Filter_Value_Push(
    Fetch => sub { $_ = uncompress $_ unless /^#REDIRECT/ },
    Store => sub { $_ = compress   $_ unless /^#REDIRECT/ },
    );
    our $text = param('modif') || $wiki{$title};
    $wiki{$title} = $text if param('modif') and param('overwrite') eq 
+'on';

    unless ($text) {
    print p [
        "No article with this exact title in $project",
        "You may try <a href=\"?project=$project&title=".
        encode('UTF-8', ucfirst decode 'UTF-8', $title)
        ."\">this link</a>.",
    ];
    die;
    }

    eval {
    my $text = decode param('encoding') || param('enc') || 'UTF-8', $m
+ain::text;
    print
    h1($title),
    encode('UTF-8', postreatment wikiformat pretreatment($text), {}, {
+ prefix=>"?project=$project&title=", }),
    "\n",
    start_form(-action=>"?project=$project&title=$title"),
    hidden(-name=>'title', -value=>$title),
    hidden(-name=>'project', -value=>$project),
    textarea(
        -name=>'modif',
        -value=>$main::text,
        -rows=>10,
        -cols=>80,
    ), br,
    checkbox(-name=> 'overwrite', -checked=>0, -label=>'write'), br,
    submit,
    end_form
    };
    print p $@ if $@;
}
else {
    print
    start_form,
    radio_group(
    -name=>'project',
    -values=>[ map s/\.zdb$//r, glob '*.zdb' ],
    -default=>'frwiki'
    ),
    br, 'search:', textfield('search'),
    br, submit,
    end_form
    ;
}
[download]

Comment on How to efficiently compress a Berkeley database? Select or Download Code

Replies are listed 'Best First'.
Re: How to efficiently compress a Berkeley database? by cavac (Prior) on Jan 03, 2012 at 15:12 UTC
The thing is that I compress each article separatlly, and I'm pretty sure I'd get much better result if I could compress them with a shared dictionnary or something. This is a classical time vs. space problem. If you compress the articles together, every time you want to read one you have to (at least) uncompress all the others that came before it. Instead of Zlib compression, you can try BZip2 or LZMA. They usually compress much better, see for example the modules IO::Compress::Bzip2 or IO::Compress::Lzma respectively. You could, in theory, use Archive::Tar to compress multiple articles into a single file when the optional IO::Zlib is installed. Does any of this really gives you a better compression ratio at all and if it does, how much will it affect your loading time? Well, you really have to build a few simple test cases with a few hundred randomly selected articles, i guess. I think using Bzip2 or LZMA could actually improve both, since CPU's are generally very fast at decompressing and you'll use less bandwidth from the harddisk. But generating the data will be very slow. As for Archive::Tar, my guess is it will slow things down while not saving any relevant space compared to your existing solution of using GZip. But, as i said, you should really test it for yourself with a relevant (randomly selected) subset of the data you will use in the full project. Only this will give you the best view of space/time tradeoffs relevant to your project. BREW /very/strong/coffee HTTP/1.1 Host: goodmorning.example.com 418 I'm a teapot	[reply]
Re^2: How to efficiently compress a Berkeley database? by grondilu (Friar) on Jan 04, 2012 at 10:40 UTC
I'm not sure using a better algorithm would help much, as the difference is significative only with large texts, and most wikipedia articles are rather short. But that's worth trying indeed.	[reply]