Re: XML parsing - huge file strategy?
by samtregar (Abbot) on Jul 18, 2008 at 15:57 UTC
|
Don't use XML for this. Instead, use MySQL's native support for dumping tab-separated files with 'SELECT INTO OUTFILE'. Then use Text::CSV_XS to read the data a line at a time. The file will be much, much smaller and I'm just about certain the parsing will be faster. It might be interesting to do it both ways and benchmark the results, if you've got the time.
But if I was going to use XML for this I'd use XML::SAX with the XML::SAX::ExpatXS parser. It's a stream-parser so it won't try to load your whole file into memory at once.
-sam
| [reply] |
Re: XML parsing - huge file strategy?
by pjotrik (Friar) on Jul 18, 2008 at 14:39 UTC
|
XML::Twig is what you seek. Create handlers for the parts you're interested in and purge the data as soon as you've processed it. | [reply] [d/l] |
Re: XML parsing - huge file strategy?
by runrig (Abbot) on Jul 18, 2008 at 15:58 UTC
|
Its not indexed properly ...
What is preventing you from indexing it properly? Even if you do decide to split it up, you shouldn't have to dump the table. And if you do go the XML route, then XML::Twig or XML::Rules.
| [reply] |
Re: XML parsing - huge file strategy?
by mirod (Canon) on Jul 18, 2008 at 15:15 UTC
|
I would think indexing the table is worth it. It's a one time cost that will speedup all the rest of your processing. That's what databases are for.
But since I am more of an XML guy... if you decide to go the XML route... oddly enough I'd recommend using XML::Twig for this ;--)
That said, If XML::Twig is too slow, you could also use the xml_split tool that comes with it to break down the file n smaller pieces. As long as you don't use a complex XPath expression to split the file (i.e. if you only use level in the tree, size or number of records), the tool will use directly XML::Parser and will probably be faster than XML::Twig. Plus it already does what you want, splitting the XML file, so why code it yourself!
you can then process each file.
And if all else fails, and the records are all in elements of the same name, you can set $/ to the closing tag,read the file one record at a time and spit them out in several files.
But really, it's data, it should live in a database.
| [reply] |
Re: XML parsing - huge file strategy?
by salva (Canon) on Jul 18, 2008 at 15:30 UTC
|
XML requires a lot of overhead, both in file size and in processing time. If you only want to dump a table, CSV is probably a better format.
And anyway, why don't you use DBI to query the database directly, processing the table data on the fly? | [reply] |
Re: XML parsing - huge file strategy?
by pc88mxer (Vicar) on Jul 18, 2008 at 14:43 UTC
|
Could I read it in a line at a time, look for the appropriate start/stop tags and parse it that way?
If your XML is formatted in a manner which is conducive to line-by-line parsing then it shouldn't be a problem.
Alternatively, if your database dump is just a sequence of one kind of element, this should also work:
open(XML, "<", ...);
{
local($/) = '</record>';
while (<XML>) { ...process one <record> element... }
}
With regard to why your experiment on the 400M record table is taking so long... are you grabbing 1M records at a time by using a LIMIT clause (i.e. LIMIT n, 1000000)? If you have any index on that table (say on column colX), adding ORDER BY colX should help speed things along.
| [reply] [d/l] [select] |
Re: XML parsing - huge file strategy?
by dHarry (Abbot) on Jul 18, 2008 at 15:21 UTC
|
I have done it myself when I was doing a proof of concept some years ago, i.e. check if it would be possible to parse large XML files. With a little program I generated several simple but big XML files. Next I used an event based parser. It worked fine and was scalable as well. Parsing the file should not be a problem and you can try different parsers to see what suits you best.
I like the suggestion of pc88mxer. As it is a dump from a relational database the structure of the resulting XML files is probably very simple. By utilizing this you’ll get the best performance.
Another option would be to chop up the file/dump it in more manageable chunks but I don’t think disk space or a few hundred MB’s of memory is a problem for you.
Saludos,
dHarry
| [reply] |
Re: XML parsing - huge file strategy?
by pileofrogs (Priest) on Jul 18, 2008 at 18:30 UTC
|
If I was in your shoes, I'd use YAML. It's moderately readable by humans, taint-safe, and way way simpler than XML.
This is something of a tangent, but I don't think I've ever seen a situation in which XML didn't suck.
OK, Actually, I'd query the MySQL server...
| [reply] |
|
|
| [reply] |
|
|
Yeah, I agree, querying the MySQL server is The Way to Go.
Mostly I have an irrational hatred for XML.
| [reply] |