XML parsing question

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello! I am new to xml and have been tasked with formatting some xml input into a csv file. This is a snippet of the xml file:

<?xml version="1.0" encoding="UTF-8"?>
<results>
<document id="\2006\200601\20060125\20060125_18.txt"
            datetime="2006/01/25"
            sourcecategory="News Archive"
            schemeversion="1.1">


  </document>
  <document id="\2006\200601\20060125\20060125_19.txt"
            datetime="2006/01/25"
            sourcecategory="News Archive"
            schemeversion="1.1">

  <record>
    <sentence-number>3</sentence-number>
    <data-class>Target</data-class>
    <group>P6</group>
  </record>

  <record>
    <sentence-number>12</sentence-number>
    <data-class>Good</data-class>
    <group>P6</group>
  </record>

  </document>
</results>,
[download]

Using this code I am able to print out the data listed WITHIN the <document> tag. :


!/usr/bin/perl

# open an output file
unless (open (OUTFILE, ">testoutput.xml")){
die ("Cannot open output file testoutput.xml\n");
}

use XML::Simple;

my $file = "infile.xml";
my $xs1 = XML::Simple->new();

my $doc = $xs1->XMLin($file);

foreach my $key (keys (%{$doc->{document}})){
   print $doc->{document}->{$key}->{'datetime'}, ",", 
'(' . $key . ')', ",",
$doc->{document}->{$key}->{sourcecategory}, ",", 
$doc->{document}->{$key}->{schemeversion}, ",", "\n";
}
[download]

I get something like: 2006/01/25,\2006\200601\20060125\20060125_18.txt,News Archive,1.1 which is perfect. but I have NO IDEA how to get at the second level of data held inside the <record> tags. Some of the <document>s have them and some don't, and the ones that do have <record> tags could have any number of them (1, 2, 7, 14, etc.). Can anyone help me access them? I'd like to print a line for each <record> that also lists the information contained in the corresponding <document>. In theory I'd like it to look like this:

2006/01/25,\2006\200601\20060125\20060125_18.txt,News Archive,1.1
2006/01/25,\2006\200601\20060125\20060125_19.txt,News Archive,1.1,3,Ta
+rget,P6
2006/01/25,\2006\200601\20060125\20060125_19.txt,News Archive,1.1,12,G
+ood,P6
[download]

I hope someone can help! Thanks, Hans

Comment on XML parsing question Select or Download Code

Replies are listed 'Best First'.
Re: XML parsing question by ikegami (Patriarch) on Sep 01, 2009 at 03:23 UTC
ug. The problem with XML::Simple is that it doesn't return a consistent format unless you disable everything it considers an advantage. It's no longer using XML::LibXML, and it's much faster. For good measure, I used Text::CSV_XS for output. use strict; use warnings; use Text::CSV_XS qw( ); use XML::LibXML qw( ); my $csv = Text::CSV_XS->new({ binary => 1, eol => $/ }); my $parser = XML::LibXML->new(); my $doc = $parser->parse_file('infile.xml'); my $root = $doc->documentElement(); for my $doc_node ( $root->findnodes('/results/document') ) { my @doc_fields = map $doc_node->getAttribute($_), qw( datetime id sourcecategory schemeversion ); my @rec_nodes = $doc_node->findnodes('record'); if (!@rec_nodes) { $csv->print(STDOUT, \@doc_fields); next; } for my $rec_node ( @rec_nodes ) { $csv->print(STDOUT, [ @doc_fields, map $rec_node->findvalue("$_/text()"), qw( sentence-number data-class group ) ]); } } [download]	[reply] [d/l]
Re: XML parsing question by toolic (Bishop) on Sep 01, 2009 at 03:22 UTC
Using XML::Twig: use strict; use warnings; use XML::Twig; my $xfile = <<'EOF'; <?xml version="1.0" encoding="UTF-8"?> <results> <document id="\2006\200601\20060125\20060125_18.txt" datetime="2006/01/25" sourcecategory="News Archive" schemeversion="1.1"> </document> <document id="\2006\200601\20060125\20060125_19.txt" datetime="2006/01/25" sourcecategory="News Archive" schemeversion="1.1"> <record> <sentence-number>3</sentence-number> <data-class>Target</data-class> <group>P6</group> </record> <record> <sentence-number>12</sentence-number> <data-class>Good</data-class> <group>P6</group> </record> </document> </results> EOF my $t= new XML::Twig( twig_handlers => {document => \&doc} ); $t->parse($xfile); sub doc { my ($twig, $doc) = @_; my $doc_csv = join ',', $doc->att('datetime'), $doc->att('id'), $doc->att('sourcecategory'), $doc->att('schemeversion'); print "$doc_csv\n" unless $doc->children('record'); for my $rec ($doc->children('record')) { print join ',', $doc_csv, $rec->first_child('sentence-number')->text(), $rec->first_child('data-class' )->text(), $rec->first_child('group' )->text(); print "\n"; } } __END__ 2006/01/25,\2006\200601\20060125\20060125_18.txt,News Archive,1.1 2006/01/25,\2006\200601\20060125\20060125_19.txt,News Archive,1.1,3,Ta +rget,P6 2006/01/25,\2006\200601\20060125\20060125_19.txt,News Archive,1.1,12,G +ood,P6 [download]	[reply] [d/l]
Re: XML parsing question by astroboy (Chaplain) on Sep 01, 2009 at 03:16 UTC
Hi, you should dump $doc with something like Data::Dumper: `$VAR1 = { 'document' => { '\\2006\\200601\\20060125\\20060125_19.txt' => { 'record' => [ { 'group' => 'P6', 'data-class' => 'Target', 'sentence-number' => '3' }, { 'group' => 'P6', 'data-class' => 'Good', 'sentence-number' => '12' } ], 'sourcecategory' => 'News Archive', 'datetime' => '2006/01/25', 'schemeversion' => '1.1' }, '\\2006\\200601\\20060125\\20060125_18.txt' => { 'sourcecategory' => 'News Archive', 'datetime' => '2006/01/25', 'schemeversion' => '1.1' } } };` [download] As you can see, the records are an array ref of hashrefs at the same level as the documents attributes ($doc->{document}->{$key}->{record}). You can iterate over the records at this point (i.e. check if record is defined and is an arrayref).	[reply] [d/l]
Re: XML parsing question by Anonymous Monk on Sep 01, 2009 at 03:23 UTC
foreach (@{$doc->{document}->{$key}->{record}}){ print Dumper($_); # Parse the contents as reqd }	[reply]
Re: XML parsing question by Jenda (Abbot) on Sep 03, 2009 at 21:04 UTC
use strict; use XML::Rules; use Text::CSV_XS; my $csv = Text::CSV_XS->new (); my $parser = XML::Rules->new( stripspaces => 7, rules => { _default => 'content', record => sub { my ($tag,$attr,$context,$parent) = @_; $csv->combine ( $parent->[-1]{datetime}, $parent->[-1]{id}, $parent->[-1]{sourcecategory}, $parent->[-1]{schemeversion}, $attr->{'sentence-number'}, $attr->{'data-class'}, $attr->{'group'}, ) and print $csv->string(),"\n" or die "Error building the + CSV line: ".$csv->error_input()."\n"; return; }, '^document' => sub { my ($tag,$attr) = @_; $csv->combine ( $attr->{datetime}, $attr->{id}, $attr->{sourcecategory}, $attr->{schemeversion}, ) and print $csv->string(),"\n" or die "Error building the + CSV line: ".$csv->error_input()."\n"; return 1; }, 'document' => '', # do not want to remember any data } ); $parser->parse(\DATA); __DATA__ <?xml version="1.0" encoding="UTF-8"?> <results> <document id="\2006\200601\20060125\20060125_18.txt" datetime="2006/01/25" sourcecategory="News Archive" schemeversion="1.1"> ... [download] The nice thing is that this works even if the XML is huge as it doesn't keep the whole document in memory. Rather it only remembers the attributes of a single <document> and the contents of one <record>. Jenda Enoch was right!* Enjoy the last years of Rome.	[reply] [d/l]