in reply to Re^2: Validating an XML file with multiple schemas
in thread Validating an XML file with multiple schemas

It's unclear to me whether by "multiple schemas" you mean validating one XML file against multiple different schemas, or whether it's one Schema file that includes other Schema files. Could you show a short, complete example, with simple XSD files that represent what you're trying to do? Please see Short, Self-Contained, Correct Example.

The following works for me.

schema.xsd:

<?xml version="1.0" encoding="UTF-8"?> <schema xmlns="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.example.com" xmlns:foo="http://www.example.com" elementFormDefault="qualified"> <include schemaLocation="included.xsd" /> <element name="hello"> <complexType> <sequence> <element name="world" type="foo:worldType" /> </sequence> </complexType> </element> </schema>

included.xsd:

<?xml version="1.0" encoding="UTF-8"?> <schema xmlns="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.example.com" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:foo="http://www.example.com" elementFormDefault="qualified"> <import namespace="http://www.w3.org/1999/xhtml" schemaLocation= "http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd" /> <complexType name="worldType"> <complexContent> <extension base="xhtml:Flow"> <attribute name="foo" type="string" use="required" /> </extension> </complexContent> </complexType> </schema>

Code - Note it was necessary to use XML::LibXML::externalEntityLoader() instead of $parser->input_callbacks(), because I didn't see another way for the callbacks to affect XML::LibXML::Schema.

use warnings; use strict; use utf8; use XML::LibXML; use URI; use HTTP::Tiny; my $http = HTTP::Tiny->new; my %cache; XML::LibXML::externalEntityLoader(sub { my ($url, $id) = @_; die "Can't handle ID '$id'" if length $id; my $uri = URI->new($url); my $file; if (!$uri->scheme) { $file = $url } elsif ($uri->scheme eq 'file') { $file = $uri->path } if (defined $file) { warn "'$uri' => Loading '$file' from disk\n"; #Debug open my $fh, '<', $file or die "$file: $!"; my $data = do { local $/; <$fh> }; close $fh; return $data; } # else die "Can't handle URL scheme: ".$uri->scheme unless $uri->scheme=~/\Ahttps?\z/i; if (!defined $cache{$uri}) { warn "'$uri' => Fetching...\n"; #Debug my $resp = $http->get($uri); die "$uri: $resp->{status} $resp->{reason}\n" unless $resp->{success}; $cache{$uri} = $resp->{content}; } else { warn "'$uri' => Cached\n"; } #Debug return $cache{$uri}; }); print "Loading schema...\n"; my $xsd = XML::LibXML::Schema->new( location => 'schema.xsd' ); my @xmls = (<<'END_XML_ONE',<<'END_XML_TWO',<<'END_XML_THREE'); <?xml version="1.0" encoding="UTF-8"?> <hello xmlns="http://www.example.com"> <world foo="bar"> <p xmlns="http://www.w3.org/1999/xhtml"> <i>x</i> </p> </world> </hello> END_XML_ONE <?xml version="1.0" encoding="UTF-8"?> <hello xmlns="http://www.example.com"> <world> <p xmlns="http://www.w3.org/1999/xhtml"> <i>x</i> </p> </world> </hello> END_XML_TWO <?xml version="1.0" encoding="UTF-8"?> <hello xmlns="http://www.example.com"> <world foo="bar"> <p xmlns="http://www.w3.org/1999/xhtml"> <foo>x</foo> </p> </world> </hello> END_XML_THREE my $i = 1; for my $xml (@xmls) { print "Validating XML #$i...\n"; my $doc = XML::LibXML->load_xml( string => $xml ); if ( eval { $xsd->validate($doc); 1 } ) { print "=> Valid!\n" } else { print "=> Invalid! $@" } } continue { $i++ }

Output:

Loading schema... 'schema.xsd' => Loading 'schema.xsd' from disk 'included.xsd' => Loading 'included.xsd' from disk 'http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd' => Fetching. +.. 'http://www.w3.org/2001/xml.xsd' => Fetching... Validating XML #1... => Valid! Validating XML #2... => Invalid! unknown-137e570:0: Schemas validity error : Element '{http +://www.example.com}world': The attribute 'foo' is required but missin +g. Validating XML #3... => Invalid! unknown-137e570:0: Schemas validity error : Element '{http +://www.w3.org/1999/xhtml}foo': This element is not expected. Expected + is one of ( {http://www.w3.org/1999/xhtml}a, {http://www.w3.org/1999 +/xhtml}br, {http://www.w3.org/1999/xhtml}span, {http://www.w3.org/199 +9/xhtml}bdo, {http://www.w3.org/1999/xhtml}object, {http://www.w3.org +/1999/xhtml}applet, {http://www.w3.org/1999/xhtml}img, {http://www.w3 +.org/1999/xhtml}map, {http://www.w3.org/1999/xhtml}iframe, {http://ww +w.w3.org/1999/xhtml}tt ).

And just for the sake of completeness, here's the original code I posted on StackOverflow that uses an XML::LibXML::InputCallback:

use warnings; use strict; use XML::LibXML; use HTTP::Tiny; use URI; my $parser = XML::LibXML->new; my $cb = XML::LibXML::InputCallback->new; my $http = HTTP::Tiny->new; my %cache; $cb->register_callbacks([ sub { 1 }, # match (URI), returns Bool sub { # open (URI), returns Handle my $uri = URI->new($_[0]); my $file; #warn "Handling <<$uri>>\n"; #Debug if (!$uri->scheme) { $file = $_[0] } elsif ($uri->scheme eq 'file') { $file = $uri->path } elsif ($uri->scheme=~/\Ahttps?\z/i) { if (!defined $cache{$uri}) { my $resp = $http->get($uri); die "$uri: $resp->{status} $resp->{reason}\n" unless $resp->{success}; $cache{$uri} = $resp->{content}; } $file = \$cache{$uri}; } else { die "unsupported URL scheme: ".$uri->scheme } open my $fh, '<', $file or die "$file: $!"; return $fh; }, sub { # read (Handle,Length), returns Data my ($fh,$len) = @_; read($fh, my $buf, $len); return $buf; }, sub { close shift } # close (Handle) ]); $parser->input_callbacks($cb); my $doc = $parser->load_xml( IO => \*DATA ); print "Is valid: ", $doc->is_valid ? "yes" : "no", "\n"; __DATA__ <?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE LinkSet PUBLIC "-//NLM//DTD LinkOut 1.0//EN" "https://www.nc +bi.nlm.nih.gov/projects/linkout/doc/LinkOut.dtd" [ <!ENTITY base.url "https://some.domain.com"> <!ENTITY icon.url "https://some.domain.com/logo.png"> ]> <LinkSet> <Link> <LinkId>1</LinkId> <ProviderId>XXXX</ProviderId> <IconUrl>&icon.url;</IconUrl> <ObjectSelector> <Database>PubMed</Database> <ObjectList> <ObjId>1234567890</ObjId> </ObjectList> </ObjectSelector> <ObjectUrl> <Base>&base.url;</Base> <Rule>/1/</Rule> </ObjectUrl> </Link> </LinkSet>

And finally, here's a variation of the caching code that uses an on-disk cache (Update: It's not perfect, because there's a tiny chance of filename collisions if clean_fragment happens to map two URLs to the same filename, but this is meant to be more of a proof-of-concept; there are plenty of other caching mechanisms available. Just one example, note how I used Memoize::Storable to cache the return values of the get_deps function here.):

my $CACHE_DIR = '/tmp/xmlcache'; use File::Path qw/make_path/; make_path($CACHE_DIR, {verbose=>1}); use URI; use HTTP::Tiny; use Text::CleanFragment qw/clean_fragment/; use File::Spec::Functions qw/catfile/; my $http = HTTP::Tiny->new; XML::LibXML::externalEntityLoader(sub { my ($url, $id) = @_; die "Can't handle ID '$id'" if length $id; my $uri = URI->new($url); my $file; if (!$uri->scheme) { $file = $url } elsif ($uri->scheme eq 'file') { $file = $uri->path } elsif ($uri->scheme=~/\Ahttps?\z/i) { # Note there is a (tiny) chance of filename collisions here! $file = catfile($CACHE_DIR, clean_fragment("$uri")); if (!-e $file) { warn "'$uri' => Mirroring to '$file'...\n"; #Debug my $resp = $http->mirror($uri, "$file"); die "$uri: $resp->{status} $resp->{reason}\n" unless $resp->{success}; } } else { die "Can't handle URL scheme: ".$uri->scheme } warn "'$uri' => Loading '$file' from disk\n"; #Debug open my $fh, '<', $file or die "$file: $!"; my $data = do { local $/; <$fh> }; close $fh; return $data; });

Replies are listed 'Best First'.
Re^4: Validating an XML file with multiple schemas
by mart0000 (Initiate) on Jan 08, 2019 at 16:30 UTC

    Here's an exaggerated example where a Personal Information schema (captured as personal.xsd) uses a flexible Contact schema. A contact can be an Address, Email, a specific online id, a phone number etc., I've provided a sample address.xsd and email.xsd. The Contact section of the Personal Information schema allows such content extension using the broader, "any" element, but still keeps the validations strict on purpose.

    To test, create a temporary folder for the 5 files (2 .xml & 3 .xsd) I've provided below. I used C:\temp1 in my example, but alter the attached perl code to point to your path.

    personal.xsd

    <?xml version="1.0" encoding="UTF-8"?> <schema xmlns="http://www.w3.org/2001/XMLSchema" xmlns:per="urn:tempuri:Personal" targetNamespace="urn:tempuri:Personal" elementFormDefault="unqualified"> <element name="PersonalInfo"> <complexType> <sequence> <element name="FirstName" type="string"/> <element name="LastName" type="string"/> <element name="Contact" type="per:ContactType"/> </sequence> </complexType> </element> <complexType name="ContactType"> <sequence> <any namespace="##other" processContents="strict" maxOccurs="unbounded"/> </sequence> </complexType> </schema>

    address.xsd

    <?xml version="1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:Contact="urn:tempuri:Contact" targetNamespace="urn:tempuri:Contact" elementFormDefault="unqualified"> <xs:element name="Address"> <xs:complexType> <xs:sequence> <xs:element name="Street" type="xs:string"/> <xs:element name="City" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>

    email.xsd

    <?xml version="1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:Contact="urn:tempuri:Contact" targetNamespace="urn:tempuri:Contact" elementFormDefault="unqualified"> <xs:element name="Email"> <xs:complexType> <xs:sequence> <xs:element name="EmailAddress" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>

    example1.xml

    <?xml version="1.0" encoding="UTF-8"?> <pinfo:PersonalInfo xmlns:pinfo="urn:tempuri:Personal" xmlns:cinfo="urn:tempuri:Contact" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:tempuri:Personal personal.xsd"> <FirstName>First Name</FirstName> <LastName>Last Name</LastName> <Contact> <cinfo:Address xsi:schemaLocation="urn:tempuri:Contact address.xsd"> <Street>Main Street</Street> <City>Main City</City> </cinfo:Address> </Contact> </pinfo:PersonalInfo>

    example2.xml

    <?xml version="1.0" encoding="UTF-8"?> <pinfo:PersonalInfo xmlns:pinfo="urn:tempuri:Personal" xmlns:cinfo="urn:tempuri:Contact" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:tempuri:Personal personal.xsd"> <FirstName>First Name</FirstName> <LastName>Last Name</LastName> <Contact> <cinfo:Email xsi:schemaLocation="urn:tempuri:Contact email.xsd"> <EmailAddress>email1@test.org</EmailAddress> </cinfo:Email> </Contact> </pinfo:PersonalInfo>

    And finally the Perl code:

    testExample.pl

    #!/usr/bin/perl package example; use XML::LibXML; use strict; use warnings; testExample1(); testExample2(); sub testExample1 { my $schema = XML::LibXML::Schema->new( location => "C:/temp1/personal.xsd" ); my $document = XML::LibXML->load_xml( location => "C:/temp1/example1.xml" ); $schema->validate( $document ); } sub testExample2 { my $schema = XML::LibXML::Schema->new( location => "C:/temp1/personal.xsd" ); my $document = XML::LibXML->load_xml( location => "C:/temp1/example2.xml" ); $schema->validate( $document ); }

    Hopefully, you'll see an error similar to the following for the 1st example:

    C:/temp1/example1.xml:0: Schemas validity error :
        Element '{urn:tempuri:Contact}Address':
        No matching global element declaration available, but demanded
        by the strict wildcard.

    It's possible I'm missing an appropriate way to reference the contact namespace for the address and email schemas within the xml. There are other ways to successfully achieve validation, such as altering the personal.xsd file to statically import the other 2 schemas. Unfortunately, that won't be an option, unless I've mistyped/overlooked a schema definition nuance while creating the example.

    Running the test outside Perl works correctly with strict validation turned on. I did have to add (with ease) those schemas programmatically though. If there's a similar way to import the Contact namespace of either schemas in Perl, right before the XML validation, it should solve the problem too.

      element Address: Schemas validity error : Element '{urn:tempuri:Contact}Address': No matching global element declaration available, but demanded by the strict wildcard.

      I get the same error when I run xmllint on these files from the commandline. It seems to me this is more of a libxml2/Schema question than a Perl question... although I haven't yet found a good description of the issue, it seems to me that it may be a limitation of libxml2 and therefore XML::LibXML that it does not respect the xsi:schemaLocation attribute, see e.g. this bug report.

      As for the design of these Schemas, I'm not sure if having both address.xsd and email.xsd provide potentially conflicting definitions for the namespace urn:tempuri:Contact is the best solution, you might want to consider one namespace per toplevel element?

      Running the test outside Perl works correctly with strict validation turned on. I did have to add (with ease) those schemas programmatically though.

      What validator are you using here, could you share more information on how you achieved this?

      There are other ways to successfully achieve validation by altering the personal.xsd file to statically import the other 2 schemas.

      Could you explain why that's not an option? E.g. which of the files in your example can't you modify and why? On the one hand, I understand the need to just be able to plug various schemas in and have them imported automatically, on the other, being able to plug any other schema into the current one kind of defeats the purpose of validation ;-) If it were me, I might set up a workaround in which I write a script that modifies personal.xsd and adds the appropriate <import> statements to pull in the other schemas, giving me control over which Schemas I want to allow. It's all XML after all, and programmatic modification isn't a problem.

        I think I suspected some limitation around libxml2 myself. It was hard to tell without enough experience with it. As for the schema examples, they were crafted to demonstrate the condition. So I could have defined the namespace either way - shared/unique, with consistent results. Having said that, when designing schemas with high reuse and extensibility, the shared namespace will start to make sense, given the right context and utilization. Very useful in larger, shared projects.

        The external validator was java based. There are a few other commercial tools out there that would have worked just as well. The original schema from which I modeled the personal.xsd, is part of a larger set managed by a vendor. The set has been in use for several years, by us and other clients. So alteration was never in scope. And besides, clients using C++ and Java processors have no trouble consuming (and generating) XML based on these schemas. I don't think I would have, either. It's just that my particular effort required the use of Perl.

        I believe I will take a different approach to validating the XML, at the expense of labor :-(. I will also attempt to get in touch with the xmlsoft, when time permits, to see if they'll view this as something to be solved (or have solved) in a future release. You appear to be very knowledgeable on this subject as well ! I thank you for your willingness and overall attitude.

Re^4: Validating an XML file with multiple schemas
by mart0000 (Initiate) on Jan 07, 2019 at 03:53 UTC

    Really appreciate you taking the effort to try things out and provide examples. It means a lot ! I will attempt to provide a more concrete example.