in reply to Processing XML with MediaWiki::DumpFile

Even the almighty gods don't think they can write well-formed XML using string concatenation, except in the most trivial cases. The almighty gods aren't substantially smarter than the rest of us - the one thing they have at their disposal to stop them seeming stupid is that they have XML libraries.

Thankfully the god Prometheus brought us XML::LibXML from the heavens. Use it; build your output as an XML tree structure, and then just call print $output->toString at the end.

Now, in your particular case, it may be that there's too much data to hold in memory at once, so constructing the entire tree in memory might be too much. However, there are various streaming XML output modules on CPAN too. I've not played with any of those enough to recommend one in particular.

Replies are listed 'Best First'.
Re^2: Processing XML with MediaWiki::DumpFile
by bitingduck (Deacon) on Feb 11, 2012 at 17:15 UTC

    There's definitely too much data to build a tree in memory-- the "small" version of the wiki dump (only the latest version of each article) that I'm starting from is about 30 GB, and my reduced data file is still about 380 MB.

    I'll poke around CPAN for a streaming XML generator.

      The recommendation helped a lot-- I switched to generating the reduced file using XML::Writer and now it reads back as valid XML. The wikimedia parser won't read it, but that's not a problem, as there are a lot of XML parsers available. It only took a couple lines of code to get it to read back the output file and dump it to another file.