comment on

I wonder why you are convinced you'd need regular expressions to do so.

Maybe need was the wrong word. I'm convinced it will be easier than building my own backtracking machine. If you squint and turn your head XML Schema's content models look a lot like primitive regular expressions. A simple sequence:

   <sequence>
      <element name="foo" minOccurs="1" maxOccurs="unbounded">
      <element name="bar" minOccurs="0" maxOccurs="1">               
      <element name="bif" minOccurs="3" maxOccurs="5">
   </sequence>

Looks to me like it could be easily checked against something like this:

    /(foo)
     (bar)?
     (bif){3,5}/x
[download]

(ignoring problems with elements named "foobar" for the moment) Now, that looks easy enough to do by hand (in fact it is, XML::Validator::Schema handles that now with a simple loop). But XML Schema lets you do much more complicated stuff, equivalent to a regex like:

    
    /(
       ((foo){2,3}
        (bar)*)
       |
       ((foo)
        (bar)*
        (bif){1,3})
      )+/x
[download]

Now it seems to me that to match something like that I've either got to invent my own backtracking engine or co-opt an existing one. Does that make sense?

And if the application is serious, that is, your schemas are large, you have lots of them, and time is important

The application is relatively serious (work related, millions of mega bucks on the line, etc) but the schemas are small and time is relatively unimportant. I think that's true for a lot of XML Schema usage, actually. XML processing, even when it's done on very large files, is usually concerned with relatively simple structures in my experience. And anyone doing XML processing in a time-critical application probably chose the wrong tool for the job. All my XML applications are batch-processing systems.

I'm sure someday we'll have a 100% feature-complete C-based XML Schema validator and we won't need my dirty little Perl hack anymore. Until then, I'm just looking for the quickest way to drop Xerces/C++ like the steaming bag it is.

-sam

In reply to Re: Re: Abusing Regular Expressions by samtregar
in thread Abusing Regular Expressions by samtregar

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.