XML RDBMS import

jamesd256 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have an XML file consisting of around 10m lines, which I would like to import to a Mysql database.

I have installed XML::RDB and written a script, and it works great on a subset of around 100,000 lines of XML.

When I try to call $rdb->make_tables(...) on the main file, it seems to fail. The memory footprint went up to 23G (on a box with 24g RAM), then the process went into status 'D' (un-recoverable sleep), and had to be killed.

I have a script for splitting the main file into smaller files, but I can't get this to work either, because of the way XML::RDB seems to work.

Using the split aproach, I would call $rdb->make_tables(..) on the first batch, then import the resulting DB schema, followed by $rdb->populate_tables(...), then on the second batch, only call $rdb->populate_tables(..), thinking that the schema already being there, the data would import into the existing tables.

Instead, there are hundreds of messages like this one:

DBD::mysql::st execute failed: Duplicate entry '3' for key 1 at /usr/l
+ocal/share/perl/5.10.0/DBIx/Database.pm line 150
[download]

So I guess XML::RDB is not suited to splitting the import in this way, owing to the way it implements the tables against the XML.

My questions are:

i) Is there a way to get this working via XML::RDB anyone can suggest

ii) Is there a better way to get the data in to a form where querying it would yield the performance of a Mysql query against an indexed database.

Thanks for reading.

Comment on XML RDBMS import Select or Download Code

Replies are listed 'Best First'.
Re: XML RDBMS import by roboticus (Chancellor) on May 19, 2010 at 11:49 UTC
jamesd256: Perhaps your problem is that you're running populate_tables in both runs, giving you some duplicate records. Maybe it would work if you remove the populate_tables from the first pass? On re-reading the node, it appears that I'm mistaken and that you've already separated out the distinct records. However, does your XML split script perhaps include all "dependent" records (as in a foreign-key relationship)? If so, perhaps it's one of the foreign key tables that includes duplicates. Alternatively, perhaps you can examine your schema that XML::RDB generated, and use XPath queries to split out the XML into individual tables, and then process the smaller files individually on a per-table basis? ...roboticus Update: added CPAN link, added italicised section after re-reading OP.	[reply]
Re^2: XML RDBMS import by Anonymous Monk on May 19, 2010 at 13:34 UTC
Yes, I think you are on the right track. I think the issue with splitting the XML is that XML::RDB tries to extract shared values into a one to many relationship, to avoid redundancy. This is great, but it defeats the possibility of splitting. If I'm right, then during the second data import, it will try to create the shared values as foreign key entries, but without checking to see if they already exist. I was hoping someone would suggest a way to get it to check for pre-existing fk entries, or some such. Looks like I might have to go down a more manual route, which is a shame as XML::RDB looks pretty elegant otherwise.	[reply]
Re: XML RDBMS import by Jenda (Abbot) on May 19, 2010 at 13:54 UTC
Seems the XML::RDB loads the whole XML into memory using XML::DOM. Is the structure of the XML so complicated that you do need the magic of XML::RDB? Jenda Enoch was right! Enjoy the last years of Rome.	[reply]