I don't want to sound pedantic, but the very fact that a thing such as perly.y exists actually proves that Perl is a CFL. Tools such as YACC, Bison and the like are used to create parsers for context free languages. This means that Perl's grammar defines a CFL.
With all due respect this is utterly and completely wrong. Please review Red Dragon for a thorough explanation.
A CFL/CFG cannot express the requirement that a variable must be declared before it is used. Furthermore a CFL/CFG cannot express the issue of arbitrarily nested constructs. All of these can be definitively proved using proper regular expressions (not perls irregular expressions) which are actually context free.
A YACC BISON grammer express the context free components of a language specification. The symbol table manipulation and extraneous such code that are attached to the various grammatical rules handle the non context free part of the language.
The fact is that most languages are primarily context free, but that they almost always have some aspect of context involved. Tools like Bison / YACC are used to simplify (and optimise) the process of constructing the context free part of the grammar. (And to be honest hand constucting LR parsers is very very difficult. LR parsers are preffered because they dont suffer from the problem that most recursive descent parsers do, that of not being able to handle left recursion.)
To the OP: There are very few useful GPPL languages (general purpose programming langauges) that are truly context free. In fact to my knowledge there are none. But I expect that im wrong on that. Nevertheless ANY language that uses arbitrarily nested parenthesis (as in to any level of nesting), or that requires that identifiers are declared before being used (both of which Perl requires) is NOT context free. Thus all of the following are not context free: Perl, C, C++, Java, Turing, Ada, Pascal, Fortran, Cobol, Modula-2...
UPDATED:As thraxil and thelenm corrected me, the struck out text is wrong. I got carried away and mixed up proper regular expressions and context free grammars. Thanks and sorry.
--- demerphq
my friends call me, usually because I'm late....
I think I need a vacation
| [reply] |
...ANY language that uses arbitrarily nested parenthesis (as in to any level of nesting)...
the rest of your post is spot on (++) but this part i believe is wrong. arbitrarily nested parenthesis are a canonical example of a construct that can't be matched by a finite acceptor (DFA|NFA) but can easily be matched by a pushdown automata (just push everytime you see an open paren, pop everytime you see a closed paren), and hence are valid in context-free languages. almost every textbook on automata and formal languages i've seen uses that as the transition between the Regular Languages chapter and the Context-Free Languages chapter.
anders pearson
| [reply] |
Hmm. Ok, its quite possible i got that wrong. Ill double check in the dragon when I get home (hopefuly soon.)
But surely the stack itself is a form of context? I mean without _somehow_ counting how many parens you have gone through how do you determine if all the parens have been properly balanced? And counting would seem to me to be a form of context. Also how does such an approach determine that ([)] is an error without context? (Hmm, is the last a good point, im not real sure.)
BTW, Im really not offering the above as a definitive statement (just in case anybody missed the ?), just my gut reaction.
Thanks for the reply, I had a feeling after I had posted it that someone would call me on this one, and that i wouldnt be able to properly respond until I get home. *sigh*
--- demerphq
my friends call me, usually because I'm late....
| [reply] [d/l] |
Oooh, FLASHBACK.
Dragon book, compiler course, writing basic interpreter, writing P compiler, much drinking, hangovers, expressions accidentally evaluating right-to-left, claiming to prof that this was an homage to APL, fun fun... :)
--
Mike
| [reply] |
my @sorted=map { pop @$_}
sort {$a->[0] <=> $b->[0]}
map {[some_funky_key_calculator($_),$_]} @objects;
Although people dont notice so much when its bottom to top for some reason. :-) BTW, tilly didnt like this at all. He wanted a special operator so it would read left to right. /Me I like it. But im a bit backwards sometimes... :-)
--- demerphq
my friends call me, usually because I'm late....
| [reply] [d/l] |
You are perfectly right in remarking that the implementation of a language has non-CFL aspects, but I never claimed otherwise.
As you mention yourself, the grammar which is represented in the YACC/Bison file is CLF, and that is exactly what I claimed.
Besides grammar there is indeed semantics (or implementation as I called it), and that is a different matter indeed. Only I was not talking about that, so please don't suggest I'm claiming things I'm not.
Besides specifying the grammar one has to make it "do" do something using some programming languagge, and hence one is specifying the semantics in some RE (recursive enumerable) formalism.
Incidently, it would be weird if one could get recursive enumerability with a pure CLF, wouldn't it? So no, I'm not claiming that.
I'm familiar with the Dragon, having implemented a number of parsers for commercial applications.
Regards, -gjb-
| [reply] |
I'm sorry to say this, but you're at least partially mistaken. (and yes, I have been reading The Dragon Book recently.. quite a lot, in fact)
Your mistake is based on your definition of the term context. Based on the way you've used the term in all the posts I've seen so far, you're using that word to mean what language developers call state, not in the Chomskian form, which differentiates:
- Type 3 Grammars aka: Regular Expressions
- Type 2 Grammars aka: Context-Free Grammars
- Type 1 Grammars aka: N-Character Lookahead Grammars
- and Type 0 Grammars aka: Recursively Enumerable Languages
A Regular Expression has state in the sense that the lexer expects to see a given set of characters in a given order. The key fact about state is that the current state-value tells the lexer everything it needs to know about the sequence of input it's seen so far.
For example, the regular expression a+b+ has three states:
- State 0: no input has been read at all.
- State 1: the input so far has been an unbroken sequence of 'a's (a+)
- State 2: the input started with a sequence of 'a's, but is now an unbroken sequence of 'b's (a+b+)
From state 0, if the lexer sees an 'a', it moves to state 1. If the lexer sees another 'a' in state 1, it stays in state 1. If the lexer sees a 'b' in state 1, it advances to state 2. if the lexer sees another 'b' in state 2, it stays in state 2.
Clearly you have to feed the lexer the right input in the right order, but this is stateful behavior, but it is not contextual behavior, in the Chomskian sense.
A Context-Free Grammar is defined in terms of productions, each of which amounts to its own state machine. The trick here is that the productions can refer to each other recursively, which makes it impossible to identify a well-formed production strictly based on immediate state, like regular expressions can.
For example, the Context-Free Grammar that represents nested parentheses looks like so:
group -> (group)
| e
where the 'e' stands for 'epsilon', which equals a null string.
We can't match that grammar with a regular expression, because just knowing that we've seen an unbroken sequence of left-parens doesn't tell us how many right-parens we need to see to balance the production. We can recognize this grammar by pushing state 1 (saw a left-paren) on a stack every time we come to the 'group' part, though. Then we start back at state 0 (no input so far), until we come to a null string between parens. Then we start popping our way out through the stack, advancing to state 2 every time we see a right paren.
This is still stateful behavior, though. The parser still carries all the information it needs in the current state. We just has to add a few more rules to carry state information across the pushes and pops, i.e.: every push takes the system to state 0 of the upcoming production, and every pop takes the system to the next state in the outer production.
In an N-Character Lookahead Grammar, the productions are ambiguous. You end up with some combination of symbols that can match more than one production. However, that ambiguity is always guaranteed to resolve itself within N characters of the ambiguous sequence.
For example, the grammar:
foo -> A+
bar -> B+
baz -> foo bar
quux -> foo B C
contains three terminal symbols, A, B, and C, and two ambiguous productions: 'baz' and 'quux'. When the parser sees a series of 'A's followed by one B, it doesn't know whether it's seeing a complete 'baz', or a partial 'quux'.
The parser can solve the ambiguity by looking at the next input symbol, though. If the next symbol is a B, the parser can be certain that it's looking at a 'baz'. If the next symbol is a C, the parser can be certain that it's looking at a 'quux'. therefore, the above grammar requires 1 character of lookahead for the parser to completely determine the validity (and identity) of an input string.
For Chomsky's meaning of the word 'context', the above grammar is contextual within a one-character linear bound.
In a Recursively Enumerable Grammar almost all bets are off. The most general description for an REG is "the set of all strings on which a given Turing machine will halt."
Please note that this definition does not contain the 'well-formed algorithm' condition, 'and halts in a reasonable (finite) amount of time.' The halting problem is a Type-0 grammar, because you will get a list of all strings on which the machine halts eventually.. you just have to let some of them run not-quite-forever.
-----
In other matters, the requirement that a symbol be defined before it can be used is a semantic rule, not a syntactic rule. The parser won't catch that kind of error, the compiler will catch it during sytax-directed analysis of the parse tree.
To flip that idea on its head, the fact that you failed to define a symbol before using it doesn't make the statement which uses that symbol syntactically incorrect. It's semantically wrong, yes, but the statement itself can still be well-formed.
Grammar and semantics have only a nodding acquaintance in most programming languages.
-----
To get back to the original question, no, Perl is not Context-Free. It has a set of operator-precedence rules that amount to N-character lookahead, thus making it a Type-1 grammar.
One example of those rules would be the expression: $x + $y ** 2. The parser has to look ahead to see the exponentiation operator before it can decide whether to interpret the expression as ($x + $y)**2 or as $x + ($y ** 2). That makes it context-sensitive, and thus not a CFG.
| [reply] [d/l] [select] |
Thank you very much for the detailed correction and explanation. You are very correct that i was mixing up state and context. After this node happend I had the opportunity to review the red dragon in detail and realized much of what you posted here, and what i had gotten wrong in my earlier posts. (Apologies to all involved). Unfortunately I havent had the time to be online so I couldnt redress the situation. :(
Incidentally, if being wrong elicits corrections this good then I don't mind being wrong occasionally.
++ Excellent reply.
--- demerphq
my friends call me, usually because I'm late....
| [reply] |
Thanks for all the replies. I really do appreciate it. Theory is wonderful, but I'm discovering that not having any application hurts both my motivation and understanding. ;-)
Unfortunately, I've discovered that most of my fellow students have not figured out even what the course is about!
- dystrophy -
| [reply] |