I'm sorry to say this, but you're at least partially mistaken. (and yes, I have been reading The Dragon Book recently.. quite a lot, in fact)
Your mistake is based on your definition of the term context. Based on the way you've used the term in all the posts I've seen so far, you're using that word to mean what language developers call state, not in the Chomskian form, which differentiates:
- Type 3 Grammars aka: Regular Expressions
- Type 2 Grammars aka: Context-Free Grammars
- Type 1 Grammars aka: N-Character Lookahead Grammars
- and Type 0 Grammars aka: Recursively Enumerable Languages
A Regular Expression has state in the sense that the lexer expects to see a given set of characters in a given order. The key fact about state is that the current state-value tells the lexer everything it needs to know about the sequence of input it's seen so far.
For example, the regular expression a+b+ has three states:
- State 0: no input has been read at all.
- State 1: the input so far has been an unbroken sequence of 'a's (a+)
- State 2: the input started with a sequence of 'a's, but is now an unbroken sequence of 'b's (a+b+)
From state 0, if the lexer sees an 'a', it moves to state 1. If the lexer sees another 'a' in state 1, it stays in state 1. If the lexer sees a 'b' in state 1, it advances to state 2. if the lexer sees another 'b' in state 2, it stays in state 2.
Clearly you have to feed the lexer the right input in the right order, but this is stateful behavior, but it is not contextual behavior, in the Chomskian sense.
A Context-Free Grammar is defined in terms of productions, each of which amounts to its own state machine. The trick here is that the productions can refer to each other recursively, which makes it impossible to identify a well-formed production strictly based on immediate state, like regular expressions can.
For example, the Context-Free Grammar that represents nested parentheses looks like so:
group -> (group)
| e
where the 'e' stands for 'epsilon', which equals a null string.
We can't match that grammar with a regular expression, because just knowing that we've seen an unbroken sequence of left-parens doesn't tell us how many right-parens we need to see to balance the production. We can recognize this grammar by pushing state 1 (saw a left-paren) on a stack every time we come to the 'group' part, though. Then we start back at state 0 (no input so far), until we come to a null string between parens. Then we start popping our way out through the stack, advancing to state 2 every time we see a right paren.
This is still stateful behavior, though. The parser still carries all the information it needs in the current state. We just has to add a few more rules to carry state information across the pushes and pops, i.e.: every push takes the system to state 0 of the upcoming production, and every pop takes the system to the next state in the outer production.
In an N-Character Lookahead Grammar, the productions are ambiguous. You end up with some combination of symbols that can match more than one production. However, that ambiguity is always guaranteed to resolve itself within N characters of the ambiguous sequence.
For example, the grammar:
foo -> A+
bar -> B+
baz -> foo bar
quux -> foo B C
contains three terminal symbols, A, B, and C, and two ambiguous productions: 'baz' and 'quux'. When the parser sees a series of 'A's followed by one B, it doesn't know whether it's seeing a complete 'baz', or a partial 'quux'.
The parser can solve the ambiguity by looking at the next input symbol, though. If the next symbol is a B, the parser can be certain that it's looking at a 'baz'. If the next symbol is a C, the parser can be certain that it's looking at a 'quux'. therefore, the above grammar requires 1 character of lookahead for the parser to completely determine the validity (and identity) of an input string.
For Chomsky's meaning of the word 'context', the above grammar is contextual within a one-character linear bound.
In a Recursively Enumerable Grammar almost all bets are off. The most general description for an REG is "the set of all strings on which a given Turing machine will halt."
Please note that this definition does not contain the 'well-formed algorithm' condition, 'and halts in a reasonable (finite) amount of time.' The halting problem is a Type-0 grammar, because you will get a list of all strings on which the machine halts eventually.. you just have to let some of them run not-quite-forever.
-----
In other matters, the requirement that a symbol be defined before it can be used is a semantic rule, not a syntactic rule. The parser won't catch that kind of error, the compiler will catch it during sytax-directed analysis of the parse tree.
To flip that idea on its head, the fact that you failed to define a symbol before using it doesn't make the statement which uses that symbol syntactically incorrect. It's semantically wrong, yes, but the statement itself can still be well-formed.
Grammar and semantics have only a nodding acquaintance in most programming languages.
-----
To get back to the original question, no, Perl is not Context-Free. It has a set of operator-precedence rules that amount to N-character lookahead, thus making it a Type-1 grammar.
One example of those rules would be the expression: $x + $y ** 2. The parser has to look ahead to see the exponentiation operator before it can decide whether to interpret the expression as ($x + $y)**2 or as $x + ($y ** 2). That makes it context-sensitive, and thus not a CFG.
|