ohm.kazhbu has asked for the wisdom of the Perl Monks concerning the following question:

I have an problem I could sure use some help with. First, be gentle. I am new to both perl and LibXML. I have been parsing a document and placing elements into an array that is then written to a speadsheet column. Durring testing it was discovered that some nodes have more than one child node of the same name. I need to combine the text from each of these child nodes into one element of the array. The format of the xml is:

<Group id="V-3021"> <title>blah blah blah</title> <description>blah blah blah</description> <Rule id="SV-41507r1_rule" severity="medium" weight="10.0"> <version>blah blah blah</version> <title>blah blah blah</title> <description>blah blah blah</description> <reference> <dc:title>blah blah blah</dc:title> <dc:publisher>blah blah blahO</dc:publisher> <dc:type>blah blah blah</dc:type> <dc:subject>blah blah blah</dc:subject> <dc:identifier>blah blah blah</dc:identifier> </reference> <fixtext fixref="F-3046r3_fix">blah blah blah</fixtext> <check system="C-39986r2_chk"> <check-content-ref name="M" href="VMS_XCCDF_Benchmark_Netw +ork - Firewall - Cisco.xml"/> <check-content>This is the text I want</check-content> </check> </Rule> </Group>

But occasionally it is like this:

<Group id="V-3021"> <title>blah blah blah</title> <description>blah blah blah</description> <Rule id="SV-41507r1_rule" severity="medium" weight="10.0"> <version>blah blah blah</version> <title>blah blah blah</title> <description>blah blah blah</description> <reference> <dc:title>blah blah blah</dc:title> <dc:publisher>blah blah blahO</dc:publisher> <dc:type>blah blah blah</dc:type> <dc:subject>blah blah blah</dc:subject> <dc:identifier>blah blah blah</dc:identifier> </reference> <fixtext fixref="F-3046r3_fix">blah blah blah</fixtext> <check system="C-39986r2_chk"> <check-content-ref name="M" href="VMS_XCCDF_Benchmark_Netw +ork - Firewall - Cisco.xml"/> <check-content>This is the text I want</check-content> <check-content>This is more text that I wantto grab and ad +d to the end of the above text</check-content> </check> </Rule> </Group>

I can pull all the text from "check-contents", but if there is more than one it throws off the row of data in the spreadsheet. I need to be able to say something like: If there are 2 or more <check-content> join the data an push into the array. If not, just push the data into the array. Now here is where the rub comes in. I am trying to pull everything below "Rule" and then pull the "check-contents" from each of those sections of XML. By doing this I should be able to join the two "check-content" section together before pushing the data into an array. The problem is that there is a namespace declared under the "reference" node (dc:). I have tried registering this namespace with no luck. I actually don't care about that section of data at all, but when I try and pull this section i get an error message that states ":1: namespace error : Namespace prefix dc on title is not defined s>ECAT-1, ECAT-2, ECSC-1</IAControls></description><reference><dc:title" If I could somehow instruct LibXML to pull everything below "Rule" regardless of what namespace is defined, that would be great. My latest attempt at this looks like this:

my $parser = XML::LibXML->new() or die $!; my $doc1 = $parser->parse_file($filename1); my $xc1 = XML::LibXML::XPathContext->new($doc1->documentElement() ); $xc1->registerNs(x => 'http://checklists.nist.gov/xccdf/1.1'); $xc1->registerNs(dc => 'http://purl.org/dc/elements/1.1'); for $Check ( $xc1->findnodes('//x:Rule') ) { my $doc2 = $parser->parse_string($Check); my $xc2 = XML::LibXML::XPathContext->new($doc2->documentElement() + ); $xc2->registerNs(x => 'http://checklists.nist.gov/xccdf/1.1'); foreach $Check_Content ( $xc2->findvalue('check-content') ) { push (@Check_Content1, $Check_Content); } @Check_Content1 = (); $result_string = $Check_Content1[0] . $Check_Content1[1]; push (@Check_Content, $result_string); } }

Replies are listed 'Best First'.
Re: LibXML Namespace issue
by choroba (Cardinal) on Jan 02, 2014 at 20:09 UTC
    In XML, every namespace must be defined in order to be used. If your XML data do not contain the definitions of the namesepaces, they are not well-formed XML and cannot be parsed by libxml (nor XML::LibXML).

    Are you sure your XML documents do not contain something like the following?

    xmlns:dc="http://purl.org/dc/elements/1.1"
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      The XML does have the namespace defined. "xmlns:dc="http://purl.org/dc/elements/1.1/" The problem I am having is I need to grab a chunk of the XML and it contains the default namespace and the above namespace. I can use the default like //x:Group/x:rule etc, but under rule is also the "dc" namespace. I either need to be able to grab that part also, or tell LibXML to ignore that section and give me everything else. The portion that is using the "dc" namespace is of no interest to me. I hope this makes sense.

        An example would probably help more. Anyway, have you tried the wildcards in XPath (Rule/*/text()) or the local-name() function?
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ