grondilu has asked for the wisdom of the Perl Monks concerning the following question:
Hello,
I'm relatively new to perl but I recently developped a cool app which I'm dying to show you and ask ideas for improvement.
I don't have a full internet access so I really wanted an off-line wikipedia viewer. Some solutions exist but I wanted something that I could totally custom.
So First I have a script that turns a Wikipedia dump into a Berkeley database:
#!/usr/bin/perl -w use strict; use warnings; use DB_File; use DBM_Filter; use Compress::Zlib; die "please provide a database name" unless $ARGV[0]; my $db = tie my %db, 'DB_File', "$ARGV[0].zdb", O_RDWR |O_CREAT or die "could not open $ARGV[0]: $!"; $db->Filter_Value_Push( Fetch => sub { $_ = uncompress $_ unless /^#REDIRECT/ }, Store => sub { $_ = compress $_ unless /^#REDIRECT/ }, ); END { undef $db, untie %db and warn "database properly closed" } $SIG{$_} = sub { die 'caught signal ' . shift } for qw(HUP TERM INT); my ($title, $text); while (<STDIN>) { if ( s,^/mediawiki/page/title=,, ) { chomp; $title = $_ } elsif ( s,^/mediawiki/page/revision/text=,, ) { $text .= $_ } elsif ( m{^/mediawiki/page$} ) { $db{$title} = $text; undef $text; } }
The input is supposed to have already been unXMLized with xml2, so the program is runned with a command such as:
$ bzcat wikipedia-dump.xml.bz2 | xml2 | perl the_script_above enwiki
Then I use a CGI script to browse the database with my local webser (thttpd). The script uses the Text::MediaWikiFormat module. It works fine so I don't really need to show it to you but I add it at the end of this post anyway.
The problem I have is that the database is still quite large, about five times more than the compressed xml dump. The thing is that I compress each article separatlly, and I'm pretty sure I'd get much better result if I could compress them with a shared dictionnary or something. I have looked in the Compress::Zlib doc but I have found no clear tip on doing that.
Any idea?
#!/usr/bin/perl -w use strict; use warnings; use DB_File; use DBM_Filter; use Compress::Zlib; use CGI qw(:standard); use utf8; use Encode qw(encode decode); use Text::MediawikiFormat 'wikiformat'; use constant HTML_HOME => a({ -href => $ENV{SCRIPT_NAME} }, 'back home +' ); sub pretreatment; sub postreatment; my %wiki; END { untie %wiki; } BEGIN { print header(-charset=>'utf-8'), start_html('offline wiki') } END { print p($@) if $@; print +HTML_HOME, end_html } my $project = param('project') || 'frwiki'; if ( param 'search' ) { my $search = param('search'); my $grep = -f "$project.titles" ? [ 'grep', '' ] : -f "$project.titles.gz" ? [ qw(zgrep .gz) ] : -f "$project.titles.xz" ? [ qw(xzgrep .xz) ] : die "No title file in CGI directory for project $project"; open my $result, '-|', $grep->[0], '-i', $search, $project.".title +s" . $grep->[1]; my @result = <$result>; print p [ @result ? scalar(@result). " result(s) for \"$search\":" . ul( map li(a({ -href=>"?project=$project&title=$_" }, $_)), @resul +t ) : "No result for '$search' on $project." ]; } elsif ( param 'title' ) { my $title = param 'title'; my $wiki = tie %wiki, 'DB_File', "$project.zdb", O_RDONLY or print p "could not open wiki $project.zdb" and die; $wiki->Filter_Value_Push( Fetch => sub { $_ = uncompress $_ unless /^#REDIRECT/ }, Store => sub { $_ = compress $_ unless /^#REDIRECT/ }, ); our $text = param('modif') || $wiki{$title}; $wiki{$title} = $text if param('modif') and param('overwrite') eq +'on'; unless ($text) { print p [ "No article with this exact title in $project", "You may try <a href=\"?project=$project&title=". encode('UTF-8', ucfirst decode 'UTF-8', $title) ."\">this link</a>.", ]; die; } eval { my $text = decode param('encoding') || param('enc') || 'UTF-8', $m +ain::text; print h1($title), encode('UTF-8', postreatment wikiformat pretreatment($text), {}, { + prefix=>"?project=$project&title=", }), "\n", start_form(-action=>"?project=$project&title=$title"), hidden(-name=>'title', -value=>$title), hidden(-name=>'project', -value=>$project), textarea( -name=>'modif', -value=>$main::text, -rows=>10, -cols=>80, ), br, checkbox(-name=> 'overwrite', -checked=>0, -label=>'write'), br, submit, end_form }; print p $@ if $@; } else { print start_form, radio_group( -name=>'project', -values=>[ map s/\.zdb$//r, glob '*.zdb' ], -default=>'frwiki' ), br, 'search:', textfield('search'), br, submit, end_form ; }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: How to efficiently compress a Berkeley database?
by cavac (Prior) on Jan 03, 2012 at 15:12 UTC | |
by grondilu (Friar) on Jan 04, 2012 at 10:40 UTC |