comment on

I'm trying to extract pages from the Wikipedia dump files using MediaWiki::Dumpfile and am having a little trouble with post-processing the extracted XML pages.

The general approach is that I run my extractor once on the complete dump (or a sub-dump for testing right now) and then create a new XML file that's just got the pages I want, but can also be processed by (almost) the same program. That gives me a smaller bunch of data to work with while I try to automate extraction of things that I want.

The first part works fine-- I can extract the pages, and produce what I think is the right XML. Then when I run the same program on the file that I've generated, it dies when it gets to ampersands in links that are in the page text (because it thinks they indicate entities). But somehow it didn't die on them the first pass, using the exact same code. And it will make it through a few records from my dump before it finds something in the text body that it doesn't like

It also dies on "&ndash" and "<br>" due to non-xmlness, but I added a couple lines to replace those. I could probably do the same with them, but then it will die on some other thing that it thinks is an unacceptable entity.

I'm probably missing some instruction in the xml, though I've copied the file header info verbatim, and double checked the structure carefully against the wikimedia schema. Somehow I'm not seeing what's missing.

The complete code is here:

#!/usr/bin/perl
# scrape_wiki_anatomy.pl
# extracts all the anatomy elements from a wikimedia dump
# uses presence of "Infobox Anatomy" to identify pages
 
 
  use strict;
  use warnings;
  use MediaWiki::DumpFile::Compat;
  
  my $now_string = localtime;
  print STDERR "start time: ", $now_string,"\n";
  
  my $file = shift(@ARGV) or die "must specify a MediaWiki dump of the
+ current pages";
  my $pmwd = Parse::MediaWikiDump->new;
  my $dump = $pmwd->revisions($file);
  my $found = 0;
  
  binmode(STDOUT, ':utf8');
  binmode(STDERR, ':utf8');
  
  print <<EOF;
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.4/" xmlns:xsi=
+"http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http:
+//www.mediawiki.org/xml/export-0.4/ http://www.mediawiki.org/xml/expo
+rt-0.4.xsd" version="0.4" xml:lang="en">
EOF

print "\n";

  print <<EOF;
    <siteinfo>
    <sitename>Wikipedia</sitename>
    <base>http://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.16wmf4</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Talk</namespace>
      <namespace key="2" case="first-letter">User</namespace>
      <namespace key="3" case="first-letter">User talk</namespace>
      <namespace key="4" case="first-letter">Wikipedia</namespace>
      <namespace key="5" case="first-letter">Wikipedia talk</namespace
+>
      <namespace key="6" case="first-letter">File</namespace>
      <namespace key="7" case="first-letter">File talk</namespace>
      <namespace key="8" case="first-letter">MediaWiki</namespace>
      <namespace key="9" case="first-letter">MediaWiki talk</namespace
+>
      <namespace key="10" case="first-letter">Template</namespace>
      <namespace key="11" case="first-letter">Template talk</namespace
+>
      <namespace key="12" case="first-letter">Help</namespace>
      <namespace key="13" case="first-letter">Help talk</namespace>
      <namespace key="14" case="first-letter">Category</namespace>
      <namespace key="15" case="first-letter">Category talk</namespace
+>
      <namespace key="100" case="first-letter">Portal</namespace>
      <namespace key="101" case="first-letter">Portal talk</namespace>
      <namespace key="108" case="first-letter">Book</namespace>
      <namespace key="109" case="first-letter">Book talk</namespace>
    </namespaces>
  </siteinfo>
EOF
  
  #this is the only currently known value but there could be more in t
+he future
  if ($dump->case ne 'first-letter') {
    die "unable to handle any case setting besides 'first-letter'";
  }
  
  my $i=0;
  while(my $page = $dump->next) {
    if (1) {
      #Sprint STDERR "Located text for revision ", $page->revision_id,
+ "\n";
      my $text = $page->text;
            
      if ($$text =~ m/\{\{Infobox Anatomy/){
          $$text =~ s/&ndash/-/g;
          $$text =~ s/\<br\>/\<br \/\>/g;
          print "\n<page>\n";
          print "<title>", $page->title,"</title>\n";
          print "<id>",$page->id,"</id>\n";
          print "<revision>\n<id>",$page->revision_id,"</id>\n";
          print "<timestamp>",$page->timestamp,"</timestamp>\n";
          print "<contributor>\n";
          if ($page->username) {print "<username>",$page->username,"</
+username>\n";}
        if ($page->userid) { print "<id>",$page->userid,"</id>\n";}
        if ($page->userip) { print "<ip>",$page->userip,"</ip>\n";}
          print "</contributor>\n";
          if ($page->minor) {print "<minor />\n";}
          print "<text xml:space=\"preserve\">", $$text, "</text>\n</r
+evision>\n</page>\n";
          $i++;
          if (($i%100)==0) {print STDERR ".";}
      }
      
      
     
    }
  }
  print "</mediawiki>";
  print STDERR "\n";
  $now_string = localtime;
  print STDERR "end time: ",$now_string ,"\n";
  print STDERR $i, " records dumped\n";
[download]

In reply to Processing XML with MediaWiki::DumpFile by bitingduck

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.