MikeEndo has asked for the wisdom of the Perl Monks concerning the following question:
I need to generate a script that would convert a text file containing several million records into a XML (MARCXML) file. I have a rough idea how to do this through shell scripting but given the size of the file required to parse I thought it might be best to run using Perl?
The basic text record is as follows:This is then converted to XML as follows:*** DOCUMENT BOUNDARY *** .000. |aam 0c --> This can be ignored .001. |aa1292700 .003. |aSIRSI .299. |aSymphonies, no.7/Vaughan Williams .702. |aThomson, Bryden,|b1928-1991|cConductor .702. |aBott, Catherine|b1952|cSoprano .702. |aLondon Symphony Chorus .702. |aLondon Symphony Orchestra .315. |aS .021. |aND 7382902 .301. |a83'31" .551. |aSt Jude's Kilburn London .260. |c1989.06.21/22 .509. |a1989 Original recording (P) date .971. |ade .976. |aND .087. |a1CD0027302 .087. |a1CD0043184 .001. |aCKEY1292700 --> This can be ignored *** DOCUMENT BOUNDARY ***
Note that numbers 001 to 009 are controlfields (only 001 and 003 in the records), whilst all other numbers are datafields.<record> <controlfield tag="001">aa1292700</controlfield> <controlfield tag="003">aSIRSI</controlfield> <datafield tag="299" ind1=" " ind2=" "> <subfield code="a">Symphonies, no.7/Vaughan Williams</subfield> </datafield> <datafield tag="702" ind1="" ind2=""> <subfield code="a">Thomson, Bryden</subfield> <subfield code="b">1928-1991</subfield> <subfield code="c">Conductor</subfield> </datafield> <datafield tag="702" ind1="" ind2=""> <subfield code="a">Bott, Catherine</subfield> <subfield code="b">1952</subfield> <subfield code="c">Soprano</subfield> </datafield> <datafield tag="702" ind1="" ind2=""> <subfield code="a">London Symphony Chorus</subfield> </datafield> <datafield tag="702" ind1="" ind2=""> <subfield code="a">London Symphony Orchestra</subfield> </datafield> <datafield tag="315" ind1="" ind2=""> <subfield code="a">S</subfield> </datafield> <datafield tag="021" ind1="" ind2=""> <subfield code="a">ND 7382902</subfield> </datafield> <datafield tag="301" ind1="" ind2=""> <subfield code="a">83'31"</subfield> </datafield> <datafield tag="551" ind1="" ind2=""> <subfield code="a">St Jude's Kilburn London</subfield> </datafield> <datafield tag="260" ind1="" ind2=""> <subfield code="c">1989.06.21/22</subfield> </datafield> <datafield tag="509" ind1="" ind2=""> <subfield code="a">1989 Original recording (P) date</subfield> </datafield> <datafield tag="971" ind1="" ind2=""> <subfield code="a">de</subfield> </datafield> <datafield tag="976" ind1="" ind2=""> <subfield code="a">ND</subfield> </datafield> <datafield tag="087" ind1="" ind2=""> <subfield code="a">1CD0027302</subfield> </datafield> <datafield tag="087" ind1="" ind2=""> <subfield code="a">1CD0043184</subfield> </datafield> </record>
The records run sequentially, i.e.<datafield tag="702" ind1="" ind2=""> <subfield code="a">Thomson, Bryden</subfield> <subfield code="b">1928-1991</subfield> <subfield code="c">Conductor</subfield> </datafield>
record *** DOCUMENT BOUNDARY *** record *** DOCUMENT BOUNDARY *** record *** DOCUMENT BOUNDARY *** record *** DOCUMENT BOUNDARY ***
I need a routine that would convert the flat file into the XML file using the rules above. Each record may have a varying level of datafields and accompanying subfields per datafield.
Any initial ideas would be greatly appreciated.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Converting text to XML; Millions of records.
by dHarry (Abbot) on Jul 07, 2009 at 08:14 UTC | |
by mzedeler (Pilgrim) on Jul 07, 2009 at 18:47 UTC | |
|
Re: Converting text to XML; Millions of records.
by roboticus (Chancellor) on Jul 07, 2009 at 16:01 UTC | |
by superfrink (Curate) on Jul 07, 2009 at 18:35 UTC | |
|
Re: Converting text to XML; Millions of records.
by Anonymous Monk on Jul 07, 2009 at 08:14 UTC | |
|
Re: Converting text to XML; Millions of records.
by Anonymous Monk on Jul 08, 2009 at 08:58 UTC |