XML parsing - huge file strategy?

ethrbunny has asked for the wisdom of the Perl Monks concerning the following question:

Im looking into different strategies for dividing up a huge MySQL table. Its not indexed properly and as a result one can't really get anything out of it. One strategy that Im looking into is dumping it out as XML and parsing it using Perl.

My ultimate goal is to break it up into a number of smaller tables that are properly indexed using a column from the original table. Its all medical data so each record is tagged with a subject ID.

The table in question has 1.77 billion records so the resultant XML file would probably be in the neighborhood of 200Gb. I have some fairly robust servers to do the work but Im more concerned about the process of parsing such a monster.

My question: is there an XML parser in Perl that will load only a small section of a file, parse it and move on to the next section? I've only worked with XML::Simple in the past and Im sure its not up to this.

Could I read it in a line at a time, look for the appropriate start/stop tags and parse it that way? Im trying the MySQL route on a similar table at the moment (400M records) but its probably going to be 24 hours of work. My process for that table is to read 1M records into a memory table and fork 8 processes at a time (its an 8 way machine) to select out records by subject ID and write them to individual tables. It started out reading each 1M records at about 10seconds. Its up to record 300M or so but each query is taking 10 minutes now. I can't imagine how long each query would be for this larger table.

Suggestions?

Comment on XML parsing - huge file strategy?

Replies are listed 'Best First'.
Re: XML parsing - huge file strategy? by samtregar (Abbot) on Jul 18, 2008 at 15:57 UTC
Don't use XML for this. Instead, use MySQL's native support for dumping tab-separated files with 'SELECT INTO OUTFILE'. Then use Text::CSV_XS to read the data a line at a time. The file will be much, much smaller and I'm just about certain the parsing will be faster. It might be interesting to do it both ways and benchmark the results, if you've got the time. But if I was going to use XML for this I'd use XML::SAX with the XML::SAX::ExpatXS parser. It's a stream-parser so it won't try to load your whole file into memory at once. -sam	[reply]
Re: XML parsing - huge file strategy? by pjotrik (Friar) on Jul 18, 2008 at 14:39 UTC
XML::Twig is what you seek. Create handlers for the parts you're interested in and `purge` the data as soon as you've processed it.	[reply] [d/l]
Re: XML parsing - huge file strategy? by runrig (Abbot) on Jul 18, 2008 at 15:58 UTC
Its not indexed properly ... What is preventing you from indexing it properly? Even if you do decide to split it up, you shouldn't have to dump the table. And if you do go the XML route, then XML::Twig or XML::Rules.	[reply]
Re: XML parsing - huge file strategy? by mirod (Canon) on Jul 18, 2008 at 15:15 UTC
I would think indexing the table is worth it. It's a one time cost that will speedup all the rest of your processing. That's what databases are for. But since I am more of an XML guy... if you decide to go the XML route... oddly enough I'd recommend using XML::Twig for this ;--) That said, If XML::Twig is too slow, you could also use the `xml_split` tool that comes with it to break down the file n smaller pieces. As long as you don't use a complex XPath expression to split the file (i.e. if you only use level in the tree, size or number of records), the tool will use directly XML::Parser and will probably be faster than XML::Twig. Plus it already does what you want, splitting the XML file, so why code it yourself! you can then process each file. And if all else fails, and the records are all in elements of the same name, you can set `$/` to the closing tag,read the file one record at a time and spit them out in several files. But really, it's data, it should live in a database.	[reply]
Re: XML parsing - huge file strategy? by salva (Canon) on Jul 18, 2008 at 15:30 UTC
XML requires a lot of overhead, both in file size and in processing time. If you only want to dump a table, CSV is probably a better format. And anyway, why don't you use DBI to query the database directly, processing the table data on the fly?	[reply]
Re: XML parsing - huge file strategy? by pc88mxer (Vicar) on Jul 18, 2008 at 14:43 UTC
Could I read it in a line at a time, look for the appropriate start/stop tags and parse it that way? If your XML is formatted in a manner which is conducive to line-by-line parsing then it shouldn't be a problem. Alternatively, if your database dump is just a sequence of one kind of element, this should also work: `open(XML, "<", ...); { local($/) = '</record>'; while (<XML>) { ...process one <record> element... } }` [download] With regard to why your experiment on the 400M record table is taking so long... are you grabbing 1M records at a time by using a `LIMIT` clause (i.e. `LIMIT n, 1000000`)? If you have any index on that table (say on column `colX`), adding `ORDER BY colX` should help speed things along.	[reply] [d/l] [select]
Re: XML parsing - huge file strategy? by dHarry (Abbot) on Jul 18, 2008 at 15:21 UTC
I have done it myself when I was doing a proof of concept some years ago, i.e. check if it would be possible to parse large XML files. With a little program I generated several simple but big XML files. Next I used an event based parser. It worked fine and was scalable as well. Parsing the file should not be a problem and you can try different parsers to see what suits you best. I like the suggestion of pc88mxer. As it is a dump from a relational database the structure of the resulting XML files is probably very simple. By utilizing this you’ll get the best performance. Another option would be to chop up the file/dump it in more manageable chunks but I don’t think disk space or a few hundred MB’s of memory is a problem for you. Saludos, dHarry	[reply]
Re: XML parsing - huge file strategy? by pileofrogs (Priest) on Jul 18, 2008 at 18:30 UTC
If I was in your shoes, I'd use YAML. It's moderately readable by humans, taint-safe, and way way simpler than XML. This is something of a tangent, but I don't think I've ever seen a situation in which XML didn't suck. OK, Actually, I'd query the MySQL server...	[reply]
Re^2: XML parsing - huge file strategy? by samtregar (Abbot) on Jul 18, 2008 at 18:48 UTC
Does Perl have a streaming parser for YAML? If not this is a non-starter since your parser would load the entire file into memory and crash long before it finished. I don't see YAML or XML bringing much of value here - an SQL table is by definition a flat set of same-sized records. It's perfect for TSV/CSV and MySQL has built-in support for it. -sam	[reply]
Re^3: XML parsing - huge file strategy? by pileofrogs (Priest) on Jul 18, 2008 at 18:55 UTC
Yeah, I agree, querying the MySQL server is The Way to Go. Mostly I have an irrational hatred for XML.	[reply]