Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello! I am new to xml and have been tasked with formatting some xml input into a csv file. This is a snippet of the xml file:
<?xml version="1.0" encoding="UTF-8"?> <results> <document id="\2006\200601\20060125\20060125_18.txt" datetime="2006/01/25" sourcecategory="News Archive" schemeversion="1.1"> </document> <document id="\2006\200601\20060125\20060125_19.txt" datetime="2006/01/25" sourcecategory="News Archive" schemeversion="1.1"> <record> <sentence-number>3</sentence-number> <data-class>Target</data-class> <group>P6</group> </record> <record> <sentence-number>12</sentence-number> <data-class>Good</data-class> <group>P6</group> </record> </document> </results>,
Using this code I am able to print out the data listed WITHIN the <document> tag. :
!/usr/bin/perl # open an output file unless (open (OUTFILE, ">testoutput.xml")){ die ("Cannot open output file testoutput.xml\n"); } use XML::Simple; my $file = "infile.xml"; my $xs1 = XML::Simple->new(); my $doc = $xs1->XMLin($file); foreach my $key (keys (%{$doc->{document}})){ print $doc->{document}->{$key}->{'datetime'}, ",", '(' . $key . ')', ",", $doc->{document}->{$key}->{sourcecategory}, ",", $doc->{document}->{$key}->{schemeversion}, ",", "\n"; }
I get something like: 2006/01/25,\2006\200601\20060125\20060125_18.txt,News Archive,1.1 which is perfect. but I have NO IDEA how to get at the second level of data held inside the <record> tags. Some of the <document>s have them and some don't, and the ones that do have <record> tags could have any number of them (1, 2, 7, 14, etc.). Can anyone help me access them? I'd like to print a line for each <record> that also lists the information contained in the corresponding <document>. In theory I'd like it to look like this:
2006/01/25,\2006\200601\20060125\20060125_18.txt,News Archive,1.1 2006/01/25,\2006\200601\20060125\20060125_19.txt,News Archive,1.1,3,Ta +rget,P6 2006/01/25,\2006\200601\20060125\20060125_19.txt,News Archive,1.1,12,G +ood,P6
I hope someone can help! Thanks, Hans

Replies are listed 'Best First'.
Re: XML parsing question
by ikegami (Patriarch) on Sep 01, 2009 at 03:23 UTC

    ug. The problem with XML::Simple is that it doesn't return a consistent format unless you disable everything it considers an advantage. It's no longer using XML::LibXML, and it's much faster.

    For good measure, I used Text::CSV_XS for output.

    use strict; use warnings; use Text::CSV_XS qw( ); use XML::LibXML qw( ); my $csv = Text::CSV_XS->new({ binary => 1, eol => $/ }); my $parser = XML::LibXML->new(); my $doc = $parser->parse_file('infile.xml'); my $root = $doc->documentElement(); for my $doc_node ( $root->findnodes('/results/document') ) { my @doc_fields = map $doc_node->getAttribute($_), qw( datetime id sourcecategory schemeversion ); my @rec_nodes = $doc_node->findnodes('record'); if (!@rec_nodes) { $csv->print(*STDOUT, \@doc_fields); next; } for my $rec_node ( @rec_nodes ) { $csv->print(*STDOUT, [ @doc_fields, map $rec_node->findvalue("$_/text()"), qw( sentence-number data-class group ) ]); } }
Re: XML parsing question
by toolic (Bishop) on Sep 01, 2009 at 03:22 UTC
    Using XML::Twig:
    use strict; use warnings; use XML::Twig; my $xfile = <<'EOF'; <?xml version="1.0" encoding="UTF-8"?> <results> <document id="\2006\200601\20060125\20060125_18.txt" datetime="2006/01/25" sourcecategory="News Archive" schemeversion="1.1"> </document> <document id="\2006\200601\20060125\20060125_19.txt" datetime="2006/01/25" sourcecategory="News Archive" schemeversion="1.1"> <record> <sentence-number>3</sentence-number> <data-class>Target</data-class> <group>P6</group> </record> <record> <sentence-number>12</sentence-number> <data-class>Good</data-class> <group>P6</group> </record> </document> </results> EOF my $t= new XML::Twig( twig_handlers => {document => \&doc} ); $t->parse($xfile); sub doc { my ($twig, $doc) = @_; my $doc_csv = join ',', $doc->att('datetime'), $doc->att('id'), $doc->att('sourcecategory'), $doc->att('schemeversion'); print "$doc_csv\n" unless $doc->children('record'); for my $rec ($doc->children('record')) { print join ',', $doc_csv, $rec->first_child('sentence-number')->text(), $rec->first_child('data-class' )->text(), $rec->first_child('group' )->text(); print "\n"; } } __END__ 2006/01/25,\2006\200601\20060125\20060125_18.txt,News Archive,1.1 2006/01/25,\2006\200601\20060125\20060125_19.txt,News Archive,1.1,3,Ta +rget,P6 2006/01/25,\2006\200601\20060125\20060125_19.txt,News Archive,1.1,12,G +ood,P6
Re: XML parsing question
by astroboy (Chaplain) on Sep 01, 2009 at 03:16 UTC
    Hi, you should dump $doc with something like Data::Dumper:
    $VAR1 = { 'document' => { '\\2006\\200601\\20060125\\20060125_19.txt' => { 'record' => [ { 'group' => 'P6', 'data-class' => 'Target', 'sentence-number' => '3' }, { 'group' => 'P6', 'data-class' => 'Good', 'sentence-number' => '12' } ], 'sourcecategory' => 'News Archive', 'datetime' => '2006/01/25', 'schemeversion' => '1.1' }, '\\2006\\200601\\20060125\\20060125_18.txt' => { 'sourcecategory' => 'News Archive', 'datetime' => '2006/01/25', 'schemeversion' => '1.1' } } };

    As you can see, the records are an array ref of hashrefs at the same level as the documents attributes ($doc->{document}->{$key}->{record}). You can iterate over the records at this point (i.e. check if record is defined and is an arrayref).

Re: XML parsing question
by Anonymous Monk on Sep 01, 2009 at 03:23 UTC
    foreach (@{$doc->{document}->{$key}->{record}}){
            print Dumper($_);
    # Parse the contents as reqd
    }
    
    
Re: XML parsing question
by Jenda (Abbot) on Sep 03, 2009 at 21:04 UTC
    use strict; use XML::Rules; use Text::CSV_XS; my $csv = Text::CSV_XS->new (); my $parser = XML::Rules->new( stripspaces => 7, rules => { _default => 'content', record => sub { my ($tag,$attr,$context,$parent) = @_; $csv->combine ( $parent->[-1]{datetime}, $parent->[-1]{id}, $parent->[-1]{sourcecategory}, $parent->[-1]{schemeversion}, $attr->{'sentence-number'}, $attr->{'data-class'}, $attr->{'group'}, ) and print $csv->string(),"\n" or die "Error building the + CSV line: ".$csv->error_input()."\n"; return; }, '^document' => sub { my ($tag,$attr) = @_; $csv->combine ( $attr->{datetime}, $attr->{id}, $attr->{sourcecategory}, $attr->{schemeversion}, ) and print $csv->string(),"\n" or die "Error building the + CSV line: ".$csv->error_input()."\n"; return 1; }, 'document' => '', # do not want to remember any data } ); $parser->parse(\*DATA); __DATA__ <?xml version="1.0" encoding="UTF-8"?> <results> <document id="\2006\200601\20060125\20060125_18.txt" datetime="2006/01/25" sourcecategory="News Archive" schemeversion="1.1"> ...

    The nice thing is that this works even if the XML is huge as it doesn't keep the whole document in memory. Rather it only remembers the attributes of a single <document> and the contents of one <record>.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.