Im looking into different strategies for dividing up a huge MySQL table. Its not indexed properly and as a result one can't really get anything out of it. One strategy that Im looking into is dumping it out as XML and parsing it using Perl.

My ultimate goal is to break it up into a number of smaller tables that are properly indexed using a column from the original table. Its all medical data so each record is tagged with a subject ID.

The table in question has 1.77 billion records so the resultant XML file would probably be in the neighborhood of 200Gb. I have some fairly robust servers to do the work but Im more concerned about the process of parsing such a monster.

My question: is there an XML parser in Perl that will load only a small section of a file, parse it and move on to the next section? I've only worked with XML::Simple in the past and Im sure its not up to this.

Could I read it in a line at a time, look for the appropriate start/stop tags and parse it that way? Im trying the MySQL route on a similar table at the moment (400M records) but its probably going to be 24 hours of work. My process for that table is to read 1M records into a memory table and fork 8 processes at a time (its an 8 way machine) to select out records by subject ID and write them to individual tables. It started out reading each 1M records at about 10seconds. Its up to record 300M or so but each query is taking 10 minutes now. I can't imagine how long each query would be for this larger table.

Suggestions?

In reply to XML parsing - huge file strategy? by ethrbunny

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.