I'm trying to extract pages from the Wikipedia dump files using MediaWiki::Dumpfile and am having a little trouble with post-processing the extracted XML pages.
The general approach is that I run my extractor once on the complete dump (or a sub-dump for testing right now) and then create a new XML file that's just got the pages I want, but can also be processed by (almost) the same program. That gives me a smaller bunch of data to work with while I try to automate extraction of things that I want.
The first part works fine-- I can extract the pages, and produce what I think is the right XML. Then when I run the same program on the file that I've generated, it dies when it gets to ampersands in links that are in the page text (because it thinks they indicate entities). But somehow it didn't die on them the first pass, using the exact same code. And it will make it through a few records from my dump before it finds something in the text body that it doesn't like
It also dies on "&ndash" and "<br>" due to non-xmlness, but I added a couple lines to replace those. I could probably do the same with them, but then it will die on some other thing that it thinks is an unacceptable entity.
I'm probably missing some instruction in the xml, though I've copied the file header info verbatim, and double checked the structure carefully against the wikimedia schema. Somehow I'm not seeing what's missing.
The complete code is here:
#!/usr/bin/perl # scrape_wiki_anatomy.pl # extracts all the anatomy elements from a wikimedia dump # uses presence of "Infobox Anatomy" to identify pages use strict; use warnings; use MediaWiki::DumpFile::Compat; my $now_string = localtime; print STDERR "start time: ", $now_string,"\n"; my $file = shift(@ARGV) or die "must specify a MediaWiki dump of the + current pages"; my $pmwd = Parse::MediaWikiDump->new; my $dump = $pmwd->revisions($file); my $found = 0; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); print <<EOF; <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.4/" xmlns:xsi= +"http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http: +//www.mediawiki.org/xml/export-0.4/ http://www.mediawiki.org/xml/expo +rt-0.4.xsd" version="0.4" xml:lang="en"> EOF print "\n"; print <<EOF; <siteinfo> <sitename>Wikipedia</sitename> <base>http://en.wikipedia.org/wiki/Main_Page</base> <generator>MediaWiki 1.16wmf4</generator> <case>first-letter</case> <namespaces> <namespace key="-2" case="first-letter">Media</namespace> <namespace key="-1" case="first-letter">Special</namespace> <namespace key="0" case="first-letter" /> <namespace key="1" case="first-letter">Talk</namespace> <namespace key="2" case="first-letter">User</namespace> <namespace key="3" case="first-letter">User talk</namespace> <namespace key="4" case="first-letter">Wikipedia</namespace> <namespace key="5" case="first-letter">Wikipedia talk</namespace +> <namespace key="6" case="first-letter">File</namespace> <namespace key="7" case="first-letter">File talk</namespace> <namespace key="8" case="first-letter">MediaWiki</namespace> <namespace key="9" case="first-letter">MediaWiki talk</namespace +> <namespace key="10" case="first-letter">Template</namespace> <namespace key="11" case="first-letter">Template talk</namespace +> <namespace key="12" case="first-letter">Help</namespace> <namespace key="13" case="first-letter">Help talk</namespace> <namespace key="14" case="first-letter">Category</namespace> <namespace key="15" case="first-letter">Category talk</namespace +> <namespace key="100" case="first-letter">Portal</namespace> <namespace key="101" case="first-letter">Portal talk</namespace> <namespace key="108" case="first-letter">Book</namespace> <namespace key="109" case="first-letter">Book talk</namespace> </namespaces> </siteinfo> EOF #this is the only currently known value but there could be more in t +he future if ($dump->case ne 'first-letter') { die "unable to handle any case setting besides 'first-letter'"; } my $i=0; while(my $page = $dump->next) { if (1) { #Sprint STDERR "Located text for revision ", $page->revision_id, + "\n"; my $text = $page->text; if ($$text =~ m/\{\{Infobox Anatomy/){ $$text =~ s/&ndash/-/g; $$text =~ s/\<br\>/\<br \/\>/g; print "\n<page>\n"; print "<title>", $page->title,"</title>\n"; print "<id>",$page->id,"</id>\n"; print "<revision>\n<id>",$page->revision_id,"</id>\n"; print "<timestamp>",$page->timestamp,"</timestamp>\n"; print "<contributor>\n"; if ($page->username) {print "<username>",$page->username,"</ +username>\n";} if ($page->userid) { print "<id>",$page->userid,"</id>\n";} if ($page->userip) { print "<ip>",$page->userip,"</ip>\n";} print "</contributor>\n"; if ($page->minor) {print "<minor />\n";} print "<text xml:space=\"preserve\">", $$text, "</text>\n</r +evision>\n</page>\n"; $i++; if (($i%100)==0) {print STDERR ".";} } } } print "</mediawiki>"; print STDERR "\n"; $now_string = localtime; print STDERR "end time: ",$now_string ,"\n"; print STDERR $i, " records dumped\n";
In reply to Processing XML with MediaWiki::DumpFile by bitingduck
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |