dalbaranster has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Iīm working with an application that read some plain text configuration files like this:

(event "LinkUp " (match "" (property_1 "" (option_1 "192.168.0.1") ) (property_2 "" (option_2 "6") ) (property_3 "" (option_1 "") (value "1234") ) ) )

The main issue, is when there are syntax error on this kind of files. I mean if there are missing open parenthesis, bracket or quotes, or if the tags has the wrong name (application is case sensitive), and there are some rules inside the configuration file like:

property_1 can be an empty value "" or include the option_1 with a value

property_2: can be a number 1,2,3,4,5,6,7

property_3: has 3 options: an empty value "", a fix value inside the double quotes "1", or use the option_1 tag

I have a basic experience with Perl, I used in the past some CPAN modules, but honestly I donīt know how can I write a syntax checker, do you know if a CPAN module can help me or can you give me please some ideas? Thanks in advance

Replies are listed 'Best First'.
Re: Make a Syntax checker
by GrandFather (Saint) on Feb 27, 2012 at 23:29 UTC

    The full on module you need to help write a syntax checker is Parse::RecDescent (or maybe Marpa) although it's a lot to get your head around! An easier approach may be Parse::RecDescent::Simple, although I have no personal experience with it. If your grammar is very simple (it seems to be) you may find Text::Balanced does the tricky bit of matching parenthesis and the rest of the checking is fairly simple to code yourself.

    True laziness is hard work
Re: Make a Syntax checker
by pemungkah (Priest) on Feb 28, 2012 at 00:23 UTC
    Let me make a suggestion. It looks like that what you need is, for lack of a better term, a pragmatic parser.

    Such a parser expects its input to be bad: missing stuff, extra crud, bad case...Unfortunately the standard parsing mechanisms and modules don't handle this well - in general they expect a set of tokens, easily-distinguished parts of the input, arranged in a specific order defined by a grammar. (Most standard parsing algorithms are big on the "input will be perfect and I will complain if not"; the standard parsing mechanisms handle errors badly.

    In your case, you have the likelihood of both errors in your tokens (casing and quoting) and grammatical errors in the input (missing parens, misordered tokens, etc.). There are several approaches to this:

    First, redefine the input language as something that is easier to parse. You have reasonable rules as to what's allowed for each property, so define the simplest possible input for it:

    Event: Linkup 1: 192.168.0.1 2: 3 3: empty
    I'm not sure about #3, as I didn't quite get what values option1 is allowed to have, nor whether they have a different meaning than the value does. If so, you could simply make "3" be "3v" or "3o" instead. Notice that we have eliminated all the stuff that is hard for humans: no quoting, no paren matching, no case-sensitive stuff. This is also really easy to parse: read the line, split on /\s*:\s*/ (a colon with or without spaces around it on either side), and slap the pairs into a hash. You can then write a file in the expected, paren-filled syntax by substituting the values into a template. I've specifically added a keyword to denote that you wanted to have an empty value: "empty".

    Since this strips off all the crud, it's easy to diagnose errors too: "You forgot item 2, which should be a number from 1 to 7." "You didn't supply a value for 3, which needs to be <whatever> or 'empty'". It's ultrasimplified, so the user should have little trouble writing it.

    The over-engineered approach involves creating a parser for the original language by adapting one of the standard parsing algorithms to add enough backtracking to handle the most common errors. You'll need to make the tokenization smarter (I expected a quote but see a "1". Since I'm at the position where I'd expect a quoted string, I'll insert a quote and continue), and the grammar will actually need to "parse" the most common errors and note them for a final set of diagnostics. For example, I've just received an open paren; this might be the start of a nested list in this property, or the previous property might have a missing close paren. I'll assume this is a nested property for now, and mark the input token string here. Oh look! The next token is "property2", so my guess was wrong. I'll push the tokens back to the mark onto the input stream again, with a close paren in front of it, and then rewind the computation to that point.

    You can use your existing files to train your parser by starting out assuming everything is perfect, and throwing errors that you'll add to the pragmatic checking and assumptions that the backtracking will make. Obviously this is stupid hard. However, you would learn boatloads in the process of writing the second approach; certainly a YAPC-talk worth. This, however, isn't pragmatic: it assumes you have huge time resources to throw at this problem.

    Let's combine these strategies to come up with something that's the best of both. Define another grammar to be used with a standard parsing algorithm, but in this one, you use the same keyword names as you did in the simplified language. Now you'll run multiple passes over the input, cleaning up the input a little more each time.

    Pass 1: remove all parens, colons, and quotes. If you find two quotes together, a quote and a close paren, or a quote and a newline, change that to the keyword "empty" (and make a note of that). Now write a parser for this language, which can actually be pretty strict -- very close to the language in option 1, in fact, since you've reduced the input variation a lot; validate that the data is good; then rewrite the file from a template using the values you got! If you're careful about your wording, you should be able to relate the values back to the original input (e.g., "for item 2, you're missing a close quote or close paren") without having to be ultra-precise or letting the user know that you stripped out all those parens and quotes and stuff. You might note during the first pass if you got any funny-looking tokens, and emit warnings ("missing a quote on line 5: '198.134.2.2<-- HERE; inserting and continuing") as appropriate.

    I'm handwaving details, but it's really not all that bad. The other suggestions (particularly Marpa) should be quite helpful in implementing this option. You'll need to write a little cleanup loop, with the appropriate code to push cleanup notes onto an array, which you'll print after the first pass; a second pass which will take the tokenized, simplified input and hand it to one of the standard parsing algorithms; and a final bit of code that takes the parsed input, validates that the data we got is OK, and then rewrites the input file using Text::Template or Template::Toolkit to create a definitely-OK, syntactically-valid final file. Obviously, if either of the first two loops gets to a point where it can't continue (pass one gets no input or finds illegal characters, pass two can't parse what it gets), we don't rewrite the output file but flag it as needing human attention.

    This should solve a good 80-90% of your problem, and show you the rest that it can't handle. Minor edit: duplicated word.

Re: Make a Syntax checker
by JavaFan (Canon) on Feb 27, 2012 at 22:37 UTC
    You say you want a syntax checker, but in your description it looks like you want to check for particular values.

    Your syntax looks like it's Lisp or Scheme. You may want to search for a suitable module on CPAN. Or perhaps you can tokenize your configuration, and turn it into XML, and then use an XML parser. In XML, your configuration could look like:

    <event arg1 = "LinkUp "> <match arg1 = ""> <property_1 arg1 = ""> <option_1 arg1 = "192.168.0.1"></option_1> </property_1> <property_2 arg1 = ""> <option_2 arg1 = "6"></option_2> </property_2> <property_3 arg1 = ""> <option_3 arg1 = ""></option_3> <value arg1 = "1234"></value> </property_3> </match> </event>
    (XML is just very verbose LISP)
Re: Make a Syntax checker
by Anonymous Monk on Feb 28, 2012 at 02:19 UTC