Abusing Regular Expressions

samtregar has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Abusing Regular Expressions by Abigail-II (Bishop) on Sep 26, 2003 at 22:20 UTC
Shalom, monks. I'm working on XML::Validator::Schema and I've become convinced that I need to use the regular expression engine to check XML Schema content models. I wonder why you are convinced you'd need regular expressions to do so. While it might be possible to do so with regular expressions, and, if I were into XML, I might do it with regular expressions just for the hack of it, I wouldn't use it for anything serious. Checking XML schema just screams "parse me, parse me". And if the application is serious, that is, your schemas are large, you have lots of them, and time is important, you wouldn't use a recursive decent parser, but a parser that's fast (limited lookahead). Having said that, I've posted here regular expressions to solve the 8 queens problem, and to find Hamiltonian paths in graphs. (I don't have the links). And on Dominus site (perl.plover.com, IIRC) there's an explanation of a regex I made that solves 3-SAT, proving that deciding whether regular expressions (even with (?{ }) and (??{ }) constructs) match, is NP-complete. Abigail	[reply]
Re: Re: Abusing Regular Expressions by samtregar (Abbot) on Sep 27, 2003 at 03:42 UTC
I wonder why you are convinced you'd need regular expressions to do so. Maybe need was the wrong word. I'm convinced it will be easier than building my own backtracking machine. If you squint and turn your head XML Schema's content models look a lot like primitive regular expressions. A simple sequence: <sequence> <element name="foo" minOccurs="1" maxOccurs="unbounded"> <element name="bar" minOccurs="0" maxOccurs="1"> <element name="bif" minOccurs="3" maxOccurs="5"> </sequence> Looks to me like it could be easily checked against something like this: `/(foo) (bar)? (bif){3,5}/x` [download] (ignoring problems with elements named "foobar" for the moment) Now, that looks easy enough to do by hand (in fact it is, XML::Validator::Schema handles that now with a simple loop). But XML Schema lets you do much more complicated stuff, equivalent to a regex like: `/( ((foo){2,3} (bar)) \| ((foo) (bar) (bif){1,3}) )+/x` [download] Now it seems to me that to match something like that I've either got to invent my own backtracking engine or co-opt an existing one. Does that make sense? And if the application is serious, that is, your schemas are large, you have lots of them, and time is important The application is relatively serious (work related, millions of mega bucks on the line, etc) but the schemas are small and time is relatively unimportant. I think that's true for a lot of XML Schema usage, actually. XML processing, even when it's done on very large files, is usually concerned with relatively simple structures in my experience. And anyone doing XML processing in a time-critical application probably chose the wrong tool for the job. All my XML applications are batch-processing systems. I'm sure someday we'll have a 100% feature-complete C-based XML Schema validator and we won't need my dirty little Perl hack anymore. Until then, I'm just looking for the quickest way to drop Xerces/C++ like the steaming bag it is. -sam	[reply] [d/l] [select]
Re: Re: Re: Abusing Regular Expressions by BrowserUk (Patriarch) on Sep 27, 2003 at 10:46 UTC
Reading your description, I tend to agree with you, but then I'm a known regex abuser anyway:) If you posted a small example of the task you want to solve, it would be fun to compare solutions. Minor nit: Shouldn't your first example be `/(foo)+ #<<? (bar)? (bif){3,5} /x;` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller If I understand your problem, I can solve it! Of course, the same can be said for you.	[reply] [d/l]
Re: Re: Re: Re: Abusing Regular Expressions by samtregar (Abbot) on Sep 27, 2003 at 20:23 UTC
Re: Abusing Regular Expressions by Abigail-II (Bishop) on Sep 27, 2003 at 10:55 UTC
Maybe need was the wrong word. I'm convinced it will be easier than building my own backtracking machine. Oh, it's probably easier than building your own backtracking machine. However, I'm not at all convinced you need backtracking to parse XML Schema. Heck, even Perl is parsable without backtracking - it's parsed by yacc, which uses limited lookahead. But then, I don't know too much about XML schema, so I might be wrong. Abigail	[reply]
Re: Re: Abusing Regular Expressions by samtregar (Abbot) on Sep 27, 2003 at 15:23 UTC
Re: Re: Re: Abusing Regular Expressions by qq (Hermit) on Sep 28, 2003 at 15:33 UTC
*If you squint and turn your head XML Schema's content models look a lot like primitive regular expressions.* Which, if I understand it correctly, is the approach taken by Relax NG. A RELAX NG schema specifies a pattern for the structure and content of an XML document. A RELAX NG schema thus identifies a class of XML documents consisting of those documents that match the pattern. http://xml.coverpages.org/relax.html (caveat: my knowledge of this is limited to light reading) Check out: Relax NG home xml.com compact syntax tutorial - which looks like regexen	[reply]
Re: Re: Re: Re: Abusing Regular Expressions by samtregar (Abbot) on Sep 29, 2003 at 05:00 UTC
Re: Abusing Regular Expressions by Enlil (Parson) on Sep 26, 2003 at 21:33 UTC
I have half a memory of seeing something written about using regular expressions to solve problems other than string parsing. Can anyone help me out with a link or two? Abigail(-II)? tends to do things with regex's that aren't plain string parsing (abuse them as you say). Super Search will undoubtedly turn up more, but here are a couple: Backtracking through the regex world Re: permuation algorithm -enlil	[reply]
Re: Re: Abusing Regular Expressions by samtregar (Abbot) on Sep 26, 2003 at 22:09 UTC
here are a couple Interesting. Both those examples show ways to use the regex engine to solve problems by creating a regex and then matching it against an empty string just to get the ball rolling. I think my problem will more likely result in a solution like: Transform schema into a regular expression which will match a correct set of nodes. Transform document stream into a delimited string form of some kind. Match schema regex from 1. against string from 2. If the match fails then the document is invalid. Can you remember seeing anything like that? Super Search will undoubtedly turn up more I did try SuperSearch before posting but I didn't find anything useful. What would you suggest I use for search terms? -sam	[reply]
Re: Abusing Regular Expressions by bsb (Priest) on Oct 03, 2003 at 03:15 UTC
Some regex abuse of my own Generating regex strings with a regex I bumped into Perl optimizations and found ways around. Might be handy. (I didn't go blind)	[reply]