in reply to Re: Processing XML with MediaWiki::DumpFile
in thread Processing XML with MediaWiki::DumpFile

There's definitely too much data to build a tree in memory-- the "small" version of the wiki dump (only the latest version of each article) that I'm starting from is about 30 GB, and my reduced data file is still about 380 MB.

I'll poke around CPAN for a streaming XML generator.

  • Comment on Re^2: Processing XML with MediaWiki::DumpFile

Replies are listed 'Best First'.
Re^3: Processing XML with MediaWiki::DumpFile
by Anonymous Monk on Feb 12, 2012 at 16:29 UTC
    The recommendation helped a lot-- I switched to generating the reduced file using XML::Writer and now it reads back as valid XML. The wikimedia parser won't read it, but that's not a problem, as there are a lot of XML parsers available. It only took a couple lines of code to get it to read back the output file and dump it to another file.