XML to CSV

Blue_eyed_son has asked for the wisdom of the Perl Monks concerning the following question:

Hi Everyone--I'm converting an XML file to CSV, and have tried AnyData and DBD::AnyData to no avail, though I am generally unfamiliar with these modules. Here's what the xml looks like:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Protocol >
<Protocol id="CDR0000256224" LegacyPDQID="16746"><ProtocolIDs><Primary
+ID><IDString>EORTC-06011</IDString></PrimaryID><OtherID><IDType>Alter
+nate</IDType><IDString>SUPERGEN-EORTC-06011</IDString></OtherID><Othe
+rID><IDType>Alternate</IDType><IDString>GMDSG-EORTC-06011</IDString><
+/OtherID><OtherID><IDType>ClinicalTrials.gov ID</IDType><IDString>NCT
+00043134</IDString></OtherID><OtherID><IDType>Alternate</IDType><IDSt
+ring>EudraCT-2005-002830</IDString></OtherID></ProtocolIDs><ProtocolS
+ponsor>Other</ProtocolSponsor><ProtocolSponsor>Pharmaceutical/Industr
+y</ProtocolSponsor><PDQProtocolTitle Audience="Professional">Phase II
+I Randomized Study of Low-Dose Decitabine Versus Standard Supportive 
+Care in Elderly Patients With Myelodysplastic Syndromes </PDQProtocol
+Title><PDQProtocolTitle Audience="Patient">Low-Dose Decitabine Compar
+ed With Standard Supportive Care in
Treating Older Patients With Myelodysplastic Syndrome</PDQProtocolTitl
+e><ProtocolAbstract><Professional><Objectives><OrderedList Style="URo
+man" id="_7"><ListItem>Compare the efficacy of low-dose decitabine vs
+ standard supportive care, in terms of overall survival, of elderly p
+atients with myelodysplastic syndromes.</ListItem><ListItem>Compare t
+he response rate and progression-free survival of patients treated wi
+th these regimens.</ListItem><ListItem>Determine the toxicity of deci
+tabine in these patients.</ListItem><ListItem>Assess the duration of 
+hospitalization and number of blood transfusions in patients treated 
+with these regimens.</ListItem><ListItem>Assess the quality of life o
+f patients treated with these regimens.</ListItem></OrderedList></Obj
+ectives><Outcome OutcomeType="Primary">Duration of overall survival</
+Outcome><Outcome OutcomeType="Secondary">Best response rate as measur
+ed by Cheson response criteria </Outcome><Outcome OutcomeType="Second
+ary">Overall progression-free survival </Outcome><Outcome OutcomeType
+="Secondary">Toxicity as assessed by CTC v2.0</Outcome><Outcome Outco
+meType="Secondary">Quality of life as assessed by EORTC QLQ30</Outcom
+e><Outcome OutcomeType="Secondary">Days in Hospital</Outcome><Outline
+><Para id="_8">This is a randomized, open-label, multicenter study.  
+Patients are stratified according to cytogenetic risk factors (good v
+s poor vs intermediate vs unknown), disease (primary myelodysplastic 
+syndrome (MDS) vs secondary MDS), and participating center.  Patients
+ with a successful cytogenetic exam are also stratified according to 
+overall International Prognostic Scoring System score (intermediate 1
+ vs intermediate 2 vs high risk).  Patients are randomized to 1 of 2 
+treatment arms.</Para><ItemizedList Compact="No" Style="bullet" id="_
+40"><ListItem><Emphasis><Strong>Arm I:  </Strong></Emphasis>Patients 
+receive decitabine IV over 4 hours every 8 hours for 3 days.  Treatme
+nt repeats every 6 weeks for 4-8 courses in the absence of disease pr
+ogression or unacceptable toxicity.</ListItem><ListItem><Emphasis><St
+rong>Arm II:</Strong></Emphasis>  Patients receive standard supportiv
+e care.</ListItem></ItemizedList><Para id="_11">Quality of life is as
+sessed at baseline, every 6 weeks during therapy, every 2 months for 
+1 year, and then every 3 months thereafter.</Para><Para id="_12">Pati
+ents are followed every 2 months for 1 year and then every 3 months t
+hereafter.</Para></Outline><EntryCriteria><DiseaseCharacteristics><It
+emizedList Compact="No" Style="bullet" id="_33"><ListItem>Diagnosis o
+f primary or secondary myelodysplastic syndromes (MDS)<ItemizedList S
+tyle="dash" id="_34"><ListItem>Any FAB or WHO criteria cellular type 
+allowed</ListItem></ItemizedList></ListItem><ListItem>Bone marrow bla
+st count on aspiration or biopsy of 1 of the following:<ItemizedList 
+Style="dash" id="_36"><ListItem>No more than 10% with poor cytogeneti
+c risk factors (defined as any numerical
  or structural abnormality of chromosome 7 and/or complex abnormaliti
+es)</ListItem>
[download]

Here's the command I'm using in DBI:

my $dbh = DBI->connect('dbi:AnyData:(RaiseError=>1)');
$dbh->func( 'XML', 'CDR256224.xml', 'CSV', 'testdbh.csv', 'ad_convert'
+);
[download]

Or in plain AnyData:

adConvert('XML', "$dir/CDR256224.xml", 'CSV',"$dir/testout.csv");
[download]

And here's is all of the output I get:

IDString,IDType,OtherID_IDString
,,
[download]

Does anyone know what I need to fix so that it converts the whole file? Thanks!

Comment on XML to CSV Select or Download Code

Replies are listed 'Best First'.
Re: XML to CSV by Trizor (Pilgrim) on Jan 29, 2007 at 22:12 UTC
It would be a lot easier to tell if the XML file could even become a table if it was formatted nicer. However it looks as if the document is a little complex for CSV, especially if this apparent header information is important: `<ProtocolIDs><Primary +ID><IDString>EORTC-06011</IDString></PrimaryID><OtherID><IDType>Alter +nate</IDType><IDString>SUPERGEN-EORTC-06011</IDString></OtherID><Othe +rID><IDType>Alternate</IDType><IDString>GMDSG-EORTC-06011</IDString>< +/OtherID><OtherID><IDType>ClinicalTrials.gov ID</IDType><IDString>NCT +00043134</IDString></OtherID><OtherID><IDType>Alternate</IDType><IDSt +ring>EudraCT-2005-002830</IDString></OtherID></ProtocolIDs>` [download] Also the records seem to have a large ammount of metadata. This XML document really doesn't seem to be easily flattened.	[reply] [d/l]
Re^2: XML to CSV by Blue_eyed_son (Sexton) on Jan 29, 2007 at 22:49 UTC
Hi Trizor--Thanks for your response. When I use XML::Simple, it successfully reads it in. Is there an easy way to transform the object I get from `$data = $xml->XMLin("CDR256224.xml");` [download] Into a tab-delimited or csv file?	[reply] [d/l]
Re^3: XML to CSV by Trizor (Pilgrim) on Jan 29, 2007 at 23:01 UTC
No, its not easy because the dificulty is in the nature of the data. The record doesn't easily break down into simple columns: you seem to be able to have several variable number things. CSV files don't work well with variable numbers of data points per record. To take your object and turn it into a csv you would first have to remove the header component of the document, then for each record figure out which feilds are variable, which record has the greatest number, and assemble your CSV that way. If you only need the header data then I reccomend something like XML::XPath to select the header nodes and extract their values, then assemble an AoA and dump that to your CSV. Basically the data is too complex for a csv to handle easily, and you'll need to make some representation and implementation choices.	[reply]
Re: XML to CSV by jZed (Prior) on Jan 29, 2007 at 23:19 UTC
I'm afraid there may be a bug in the AnyData convert(). Try to just do it with SQL: `my $dbh = DBI->connect('dbi:AnyData(RaiseError=>1):'); $dbh->ad_catalog('myXml','XML','old.xml'); $dbh->ad_catalog('myCsv','CSV','new.csv'); $dbh->do("CREATE TABLE myCsv AS SELECT * FROM myXml");` [download] Trizor is correct that you need to select a portion of the XML that fits into a table-like structure. AnyData uses XML::Twig under the hood so it does allow you to pick and choose (to a certain extent) what part of the XML file to convert to CSV. You may have to experiment with the col_names and other attributes, see AnyData::Format::XML.	[reply] [d/l]
Re: XML to CSV by graff (Chancellor) on Jan 30, 2007 at 06:08 UTC
As others have said, the complexity of this xml structure doesn't really support a direct conversion to CSV. Indeed, I'd guess there are a variety of ways in which CSV output might be constructed from this xml data. Apart from that, it appears that a lot of the xml structure is involved with typesetting/formatting of the content: tags like "OrderedList", "ItemizedList", "ListItem", "Style", "Emphasis", "Strong", etc, indicate stuff that is intended for CSS handling, rather than database construction. Some of these tags are only vaguely "structural" (in the sense of describing the logical/semantic organization of the data), while others may be purely "cosmetic". I don't see any obvious way to transform this xml into a coherent csv. So you need a more informed specification of your goal: what exactly should the csv file contain (i.e. how many fields are needed, and what are their names)? And with that in mind, how can the csv rows and columns be filled in, based on the clues available in the xml data? Some manual analysis of the data will be needed in order to write the code. Actually, it's possible (likely?) that a proper solution will involve two or more relational tables rather than just one table. And while you're at it, you'll need to worry about making sure your csv output is well formed: quote fields where necessary, escape quotes and apostrophes within fields as needed, and watch out for newlines embedded within field values (maybe normalize these to spaces).	[reply]