spstansbury has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to query an XML record based on an attribute, and then return child elements of that record.

This is the code, eventually this will be a subroutine that the query string will be fed to.

use strict; use warnings; use XML::XPath; use XML::XPath::XMLParser; my $data_file = "./nvdcve-2.0-2008.xml"; my $cve_id = 'CVE-2008-3763'; my $query = "/nvd/entry[@id = $cve_id]"; my $xp = XML::XPath-> new(filename => "$data_file"); foreach my $cve_entry ($xp->findnodes($query)) { my $AV = $cve_entry->find('//entry/vuln:cvss/cvss:base_metrics +/cvss:access-vector'); my $AC = $cve_entry->find('//entry/vuln:cvss/cvss:base_metrics +/cvss:access-vector'); my $AV = $cve_entry->find('//entry/vuln:cvss/cvss:base_metrics +/cvss:access-complexity'); # etc... }

The chunk of the source file looks like this:

<?xml version='1.0' encoding='UTF-8'?> <nvd xmlns:cpe-lang="http://cpe.mitre.org/language/2.0" xmlns:vuln="ht +tp://scap.nist.gov/schema/vulnerability/0.4" xmlns:cvss="http://scap. +nist.gov/schema/cvss-v2/0.2" xmlns:xsi="http://www.w3.org/2001/XMLSch +ema-instance" xmlns="http://scap.nist.gov/schema/feed/vulnerability/2 +.0" nvd_xml_version="2.0" pub_date="2009-01-28T03:10:00" xsi:schemaLo +cation="http://scap.nist.gov/schema/feed/vulnerability/2.0 http://nvd +.nist.gov/schema/nvd-cve-feed_2.0.xsd"> <entry id="CVE-2008-3763"> <vuln:cvss> <cvss:base_metrics> <cvss:score>6.8</cvss:score> <cvss:access-vector>NETWORK</cvss:access-vector> <cvss:access-complexity>MEDIUM</cvss:access-complexity> <cvss:authentication>NONE</cvss:authentication> <cvss:confidentiality-impact>PARTIAL</cvss:confidentiality +-impact> <cvss:integrity-impact>PARTIAL</cvss:integrity-impact> <cvss:availability-impact>PARTIAL</cvss:availability-impac +t> <cvss:source>http://nvd.nist.gov</cvss:source> cvss:generated-on-datetime>2008-08-21T14:37:00.000-04:00</ +cvss:generated-on-datetime> </cvss:base_metrics> </vuln:cvss> </entry>

I get "invalid attribute" or other errors - the "@" is not being recognized as a descriptor for the attribute, it seems...

In addition, am I OK with not set the namespace? From the XML::XPath page at CPAN:

set_namespace($prefix, $uri)

Sets the namespace prefix mapping to the uri.

Normally in XML::XPath the prefixes in XPath node tests take their context from the current node. This means that foo:bar will always match an element <foo:bar> regardless of the namespace that the prefix foo is mapped to (which might even change within the document, resulting in unexpected results). In order to make prefixes in XPath node tests actually map to a real URI, you need to enable that via a call to the set_namespace method of your XML::XPath object.

As always, thanks for your time... Scott

Replies are listed 'Best First'.
Re: XPath query issue...
by ikegami (Patriarch) on Sep 02, 2009 at 19:22 UTC

    I don't know how XML::XPath (mis)handles namespaces, so I'll assume no fixes are needed in that area.

    sub to_xpath_str_literal { my ($s) = @_; return qq{"$s"} if $s !~ /"/; return qq{'$s'} if $s !~ /'/; $s =~ s/'/',"'",'/g; return qq{concat('$s')}; } my $cve_id_lit = to_xpath_str_literal($cve_id); for my $entry ($xp->findnodes("/nvd/entry[\@id=$cve_id_lit]")) { my ($metrics) = $entry->findnodes('vuln:cvss/cvss:base_metrics'); my $av = $metrics->find('cvss:access-vector'); my $ac = $metrics->find('cvss:access-complexity'); ... }

    Three fixes:

    • Corrected the string literal to avoid interpolating @id.
    • Properly converted the id string into an XPath literal.
    • Adjusted the inner XPath to be relative to the topic node rather than the root ("/").

      Thank you!

      That worked as I had intended, but could not make happen...

      Question: The process is very slow. The source files range from 12Mb to 30Mb. Is there another approach that would search through the file faster?

      Thanks again for your help...

      Scott...

        I don't know how fast a parser you are using. XML::LibXML is extremely fast.

        Each XPath is surely parsed repeatedly. You could avoid using XPath inside the loop.

        Each find will search through all of the child nodes, but since it seems you want most of the child nodes. You could replace the XPath with a loop that populates a hash.

Re: XPath query issue...
by Corion (Patriarch) on Sep 02, 2009 at 17:09 UTC

    I don't believe you:

    use strict; ... my $cve_id = 'CVE-2008-3763'; my $query = "/nvd/entry[@id = $cve_id]";

    This code gives me:

    Possible unintended interpolation of @id in string at tmp.pl line 3. Global symbol "@id" requires explicit package name at tmp.pl line 3. Execution of tmp.pl aborted due to compilation errors.

    So you're either not using strict and warnings, or you're not posting the code you're running. Either way, I don't see how you expect us to help you, if you're not forthcoming about what you're actually doing.

      Did you try changing
      my $query = "/nvd/entry[@id = $cve_id]";
      to:
      my $query = "/nvd/entry[@id = '$cve_id']";
      And regarding a better parser it depends on how much information you want to read, XML::LibXML use a DOM Parser which builds a tree model for the document and then you can access that tree, the other standard parsing model is SAX which is based on events, and that is the fastest approach. Of course SAX parsers are much more complicated but in my experience it resulted on an improvement of 500%.

      Umm, that's the point - the code doesn't work - if it worked I wouldn't be asking how to fix it..

        The issue is just that - the @ is being interpreted as a global symbol and it should behave as a XPATH symbol signifying the attribute.

        my $query = "/nvd/entry[@id = $cve_id]"; my $xp = XML::XPath-> new(filename => "$data_file"); foreach my $cve_entry ($xp->findnodes($query)) { }