in reply to Converting text to XML; Millions of records.

I have a rough idea how to do this through shell scripting but given the size of the file required to parse I thought it might be best to run using Perl?

Why? Not that I would like to discourage using Perl of course;) If you have a rough idea how to do it with a shell script surely you can do it in Perl. Give it a try and ask for help if you get stuck. You might want to take a look at XML::XMLWriter.

  • Comment on Re: Converting text to XML; Millions of records.

Replies are listed 'Best First'.
Re^2: Converting text to XML; Millions of records.
by mzedeler (Pilgrim) on Jul 07, 2009 at 18:47 UTC

    First off, I agree with roboticus. The proposed XML schema is obscure and doesn't add any value. If you have any say, please try changing it. Also, I'd suggest trying the marc2xml-tools, but given that they aren't suitable, read on...

    As far as I can see, XML::Writer isn't streaming its output, which means that you'll be buffering a data structure representing the entire XML document in memory. This doesn't really sound like a useful approach, given the expected output size, unless the output is generated in chunks (the records described in the question). For this purpose, it seems that XML::Writer wants to insert processing instructions, which makes chunk generation unfeasible without nasty hacks.

    For generating chunks, I'd suggest XML::Generator - a wonderfully simple and flexible module.

    If the chunk approach is undesirable, I'd look for a module that can serialize SAX events.