Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Offline wikipedia using Perl

by grondilu (Friar)
on Mar 08, 2012 at 13:49 UTC ( #958466=CUFP: print w/replies, xml ) Need Help??

I'm finally happy with the code I wrote to browse wikipedia offline.

The most tricky part was to keep the database small. So I made a database with blocks of 256 articles. Each block is frozen using Storable and then compressed with Bzip2. Doing so, the created database is only about 15% larger than the original xml.bz2

I also use XML::Parser to parse wikipedia's database dump.

Here is the most difficult part: converting the XML database (see into a usable one:

#!/usr/bin/perl -w use v5.14; use strict; use warnings; die "please provide a database name" unless $ARGV[0]; my $rootname = $ARGV[0] =~ s/\.xml\.bz2\E//r =~ s,.*/,,r; use Encode; use XML::Parser; use IO::Uncompress::Bunzip2; use IO::Compress::Bzip2 qw(bzip2 $Bzip2Error); use Digest::MD5 qw(md5); use Storable qw(freeze thaw); open my $db, "> $rootname.db"; END { close $db } open my $t, "> $rootname.titles"; END { close $t } my ($title, @block, $char); my %debug; use DB_File; tie my %index, 'DB_File', "$rootname.index"; END { untie %index } $SIG{INT} = sub { die "caught INT signal" }; END { printf "%d entries made\n", scalar keys %index } sub store { my $freeze = freeze shift; bzip2 \($freeze, my $z); my $start = tell $db; print $db pack('L', length $z), $z; printf "block %d -> %d, compressed ratio is %2.2f%%\n", $start, tell($db), 100*length($z)/length($freeze), ; } my $parser = new XML::Parser Handlers => { Char => sub { shift; $char .= shift }, Start => sub { undef $char }, End => sub { shift; given( $_[0] ) { when( 'title' ) { $title = encode 'utf8', $char; say $t $title + } when( 'text' ) { push @block, $char; $index{length($title) > 16 ? md5 $title : $title} = pack 'LC', tell($db), scalar(@block) - 1; if (@block == 256) { store \@block; undef @block; } } } }, }; $parser->parse( new IO::Uncompress::Bunzip2 $ARGV[0] ); END { store \@block if @block }

I think it works pretty well, even if the rendering of the Text::Mediawiki module is a bit ugly for some pages. I need to take care of the references for instance. Still, it does the job, and it's much faster than on-line browsing.

I posted everything (including the CGI script) on my wikipedia userpage, as it also concerns wikipedia users:

EDIT. I also set up a github repo:

Replies are listed 'Best First'.
Re: Offline wikipedia using Perl
by wazoox (Prior) on Mar 09, 2012 at 18:09 UTC

    This looks nice, but I don't really get how it must be used, I suppose I should check your wikipedia page for the missing parts :)

    Just a couple of proposed enhancements :

    • as you're using "warnings", there is not point calling "perl -w"
    • you don't check for errors when opening files and writing. This is worse than a crime, a fault :)
      as you're using "warnings", there is not point calling "perl -w"
      There is a difference, from perldoc warnings:
      The warnings pragma is a replacement for the command line flag -w , but the pragma is limited to the enclosing block, while the flag is global. See perllexwarn for more information.
      -w does everything warnings does, not the other way around. That being said, it is unlikely the OP wants to enable warnings for use'd modules (XML::Parser, etc.).

      Once the database has been built, it is supposed to be used with a CGI script and a local webserver. The CGI script is on the wikipedia page indeed, but it is kind of ugly so I didn't post it here as I am not much proud of it :) A CGI is easy to write anyway. Notice that it requires Text::Mediawiki in order to turn wiki format into HTML.

      As for checking errors during file openings and writings, I'll try to correct this.

Re: Offline wikipedia using Perl
by spx2 (Deacon) on Mar 13, 2012 at 13:25 UTC
    this project looks very interesting, put it up on , maybe some people might want to fork it and add stuff to it

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://958466]
Approved by marto
Front-paged by Arunbear
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (2)
As of 2022-05-16 05:11 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (62 votes). Check out past polls.