MikeEndo:

(Note: I realize that you may have no control over the project requirements. I also realize that the example you gave may be a toy example. But I feel I must comment anyway...)

As I see it, XML is bloated and ugly. However, it's useful because it allows you to make your data descriptive and easier parse and use in new ways. So I suggest that you change your schema, if possible. I don't really see how

<datafield tag="702" ind1="" ind2=""> <subfield code="a">Thomson, Bryden</subfield> <subfield code="b">1928-1991</subfield> <subfield code="c">Conductor</subfield> </datafield>

is any more descriptive than the original file. I feel you would be better served giving descriptive tags to your data. Perhaps something like:

<conductor> <Name> <Last>Thomson</Last> <First>Bryden</First> </Name> <Born>1928</Born> <Died>1991</Died> </conductor>

In my job, I *frequently* have to reverse engineer file formats, and I would greatly prefer to reverse engineer the first file format than the XML version, unless the tags were meaningful. Without meaningful field names, it just makes detecting meaningful patterns in the data more difficult with the visual clutter.

Just my $0.02.

...roboticus


In reply to Re: Converting text to XML; Millions of records. by roboticus
in thread Converting text to XML; Millions of records. by MikeEndo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.