comment on

Hi Monks,

I need to generate a script that would convert a text file containing several million records into a XML (MARCXML) file. I have a rough idea how to do this through shell scripting but given the size of the file required to parse I thought it might be best to run using Perl?

The basic text record is as follows:

*** DOCUMENT BOUNDARY ***
.000. |aam  0c --> This can be ignored
.001. |aa1292700
.003. |aSIRSI
.299.   |aSymphonies, no.7/Vaughan Williams
.702.   |aThomson, Bryden,|b1928-1991|cConductor
.702.   |aBott, Catherine|b1952|cSoprano
.702.   |aLondon Symphony Chorus
.702.   |aLondon Symphony Orchestra
.315.   |aS
.021.   |aND 7382902
.301.   |a83'31"
.551.   |aSt Jude's Kilburn London
.260.   |c1989.06.21/22
.509.   |a1989 Original recording (P) date
.971.   |ade
.976.   |aND
.087.   |a1CD0027302
.087.   |a1CD0043184
.001. |aCKEY1292700 --> This can be ignored
*** DOCUMENT BOUNDARY ***
[download]

This is then converted to XML as follows:

<record>
<controlfield tag="001">aa1292700</controlfield>
<controlfield tag="003">aSIRSI</controlfield>
<datafield tag="299" ind1=" " ind2=" ">
<subfield code="a">Symphonies, no.7/Vaughan Williams</subfield>
</datafield>
<datafield tag="702" ind1="" ind2="">
<subfield code="a">Thomson, Bryden</subfield>
<subfield code="b">1928-1991</subfield>
<subfield code="c">Conductor</subfield>
</datafield>
<datafield tag="702" ind1="" ind2="">
<subfield code="a">Bott, Catherine</subfield>
<subfield code="b">1952</subfield>
<subfield code="c">Soprano</subfield>
</datafield>
<datafield tag="702" ind1="" ind2="">
<subfield code="a">London Symphony Chorus</subfield>
</datafield>
<datafield tag="702" ind1="" ind2="">
<subfield code="a">London Symphony Orchestra</subfield>
</datafield>
<datafield tag="315" ind1="" ind2="">
<subfield code="a">S</subfield>
</datafield>
<datafield tag="021" ind1="" ind2="">
<subfield code="a">ND 7382902</subfield>
</datafield>
<datafield tag="301" ind1="" ind2="">
<subfield code="a">83'31"</subfield>
</datafield>
<datafield tag="551" ind1="" ind2="">
<subfield code="a">St Jude's Kilburn London</subfield>
</datafield>
<datafield tag="260" ind1="" ind2="">
<subfield code="c">1989.06.21/22</subfield>
</datafield>
<datafield tag="509" ind1="" ind2="">
<subfield code="a">1989 Original recording (P) date</subfield>
</datafield>
<datafield tag="971" ind1="" ind2="">
<subfield code="a">de</subfield>
</datafield>
<datafield tag="976" ind1="" ind2="">
<subfield code="a">ND</subfield>
</datafield>
<datafield tag="087" ind1="" ind2="">
<subfield code="a">1CD0027302</subfield>
</datafield>
<datafield tag="087" ind1="" ind2="">
<subfield code="a">1CD0043184</subfield>
</datafield>
</record>
[download]

Note that numbers 001 to 009 are controlfields (only 001 and 003 in the records), whilst all other numbers are datafields.
Subfield codes (within datafields) are indicated the leading letter (a,b,c) and by a pipe:
.702. |aThomson, Bryden,|b1928-1991|cConductor
e.g:

<datafield tag="702" ind1="" ind2="">
<subfield code="a">Thomson, Bryden</subfield>
<subfield code="b">1928-1991</subfield>
<subfield code="c">Conductor</subfield>
</datafield>
[download]

The records run sequentially, i.e.

record
*** DOCUMENT BOUNDARY ***
record
*** DOCUMENT BOUNDARY ***
record
*** DOCUMENT BOUNDARY ***
record
*** DOCUMENT BOUNDARY ***
[download]

I need a routine that would convert the flat file into the XML file using the rules above. Each record may have a varying level of datafields and accompanying subfields per datafield.

Any initial ideas would be greatly appreciated.
Thanks
MikeE

In reply to Converting text to XML; Millions of records. by MikeEndo

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.