in reply to Re^8: Block-structured language parsing using a Perl module?
in thread Block-structured language parsing using a Perl module?

BrowserUk:

I'll have to disagree with you there. I really don't see any need for the lexer to do all the machinations you're worrying with. Traditionally, all that stuff would be in the parser. For example, the lexer for my current Marpa experiments is this:

#--------------------------------------------------------------------- +------- # LEXER section. Grab tokens and feed to the recognizer # my %keywords = map { $_=>undef } # Keywords for a toy language qw( do else end goto if last loop next print program then while un +til ), # Artificial tokens qw( number string ) ; # Operators $keywords{$$_[1]}='OP_' . uc $$_[0] for ( [lparen=>'('], [rparen=>')'], [mult=>'*'], [div=>'/'], [add=>'+'], [subtr=>'-'], [EOS=>';'], [comment=>'#'], [EQ=>'='], [NEQ=>'<>'], [LT=>'<'], [LTE=>'<='], [GT=>'>'], [GTE=>' +>='], [EQADD=>'+='], [EQSUB=>'-='], [EQMUL=>'*='], [EQDIV=>'/='], [COMMA +=>','], ); my $FName = shift or die "Missing filename"; open my $FH, '<', $FName; my $token; my $cnt=0; my $curpos=0; my $fl_die=0; OUTER: while (<$FH>) { s/\s+$//; printf "\n% 3u: %s\n", $., $_; pos($_)=0; while (!$fl_die) { /\G\s*/gc; $curpos = pos($_); last if $curpos>=length($_); ++$cnt; # last OUTER if $cnt>40; if (/\G([-+\/*]=?|=)/gc) { $token=tk_xform('OP', $1) } elsif (/\G([;:,])/gc) { $token=tk_xform('OP', $1) } elsif (/\G(<[=>]?|>=?)/gc) { $token=tk_xform('OP', $1) } elsif (/\G(#.*)/gc) { $token=['COMMENT', $1] } elsif (/\G(".*?")/gc) { $token=['string',$1] } elsif (/\G(\d+)/gc) { $token=['number', $1] } elsif (/\G(\w[_\w]*)/gc) { $token=tk_xform('name', $1) } else { $token=['ERROR','UNEXPECTED INPUT', substr($_,pos($_))] +; ++$fl_die } print("ABEND (token #:$cnt\n") && last OUTER if $fl_die; next unless defined $token; if ($fl_trace) { print " " . (" " x $curpos) . "^"; no warnings; if ($$token[0] eq 'COMMENT') { print "comment (ignored) +" } elsif (!defined $$token[1]) { print $$token[0] } else { print "$$token[0]=$$toke +n[1]" } print "\n"; } next if $$token[0] eq 'COMMENT'; # Feed the token into the parser if (@$token < 2) { push @$token, $$token[0] +; } $P->read(@$token); #print " progress: ", join(", ", map { "(".join(",",@$_).")" + } @{$P->progress}), "\n"; #print " expected: ", join(", ", @{$P->terminals_expected}), + "\n"; $token=undef; } }

As you can see, it's pretty straightforward, and it doesn't worry about tokens other than the current one. It's all of 70 lines, much of which is pretty printing and diagnostics stuff. It doesn't care a whit about parenthesis rules and such, it just chops the text into convenient tokens and passes them along to the parser.

All the ambiguities you're wrestling with are bits of the parser. The grammar is where we throw things in to enforce token ordering, assign meaning to statements and such. So far my grammar looks like this (I'm still whacking away at my parser, so it's still in progress):

my $TG = Marpa::R2::Grammar->new({ start=>'FILE', actions=>'ToyLang', default_action=>'swallow', unproductive_ok=>[qw( FILE )], rules=>[ # A file contains a PROGRAM and zero or more SUBROUTINES. + The # subroutine definitions may precede and/or follow the pro +gram. [ FILE=>[qw( PROGRAM name stmt_list )], 'swallow' ], #PROG +RAM FILE2)], ], [ FILE=>[qw( COMMENT FILE )], ], #stmt_list PROGRAM FILE2) +], ], # [ FILE=>[qw( PROGRAM name stmt_list PROGRAM FILE2)], ], # [ FILE=>[qw( SUB name stmt_list sub FILE)], ], # [ FILE2=>[qw( SUB name stmt_list sub FILE2)], ], # A statement list consists of zero or more statements fol +lowed # by END. We don't care whether or not there's an end of # statement marker before END. [ stmt_list=>[qw( END )], 'discard' ], [ stmt_list=>[qw( stmt stmt_list_2 )], 'first_on' ], [ stmt_list_2=>[qw( END )], 'discard' ], [ stmt_list_2=>[qw( OP_EOS END )] ], [ stmt_list_2=>[qw( OP_EOS stmt stmt_list_2 )], 'second_on +' ], # [ stmt=>[qw( IF expr if_body )], ], [ stmt=>[qw( PRINT expr print_body )], ], [ stmt=>[qw( WHILE expr DO do_body )], ], # [ stmt=>[qw( DO do_body WHILE expr )], ], [ stmt=>[qw( name assop expr )], 'binary_op' ], [ do_body=>[qw( LOOP )], ], [ do_body=>[qw( stmt do_body_2 )], 'first_on' ], [ do_body_2=>[qw( LOOP )], ], [ do_body_2=>[qw( OP_EOS LOOP )], 'second_arg' ], [ do_body_2=>[qw( OP_EOS stmt do_body_2 )], 'second_arg' ] +, [ print_body=>[qw( OP_EOS )], ], [ print_body=>[qw( OP_COMMA expr print_body )], 'second_on +' ], [ expr=>[qw( term )], 'first_arg' ], [ expr=>[qw( expr logop expr )], 'binary_op' ], [ term=>[qw( term addop term )], 'binary_op' ], [ term=>[qw( factor )], 'first_arg'], [ factor=>[qw( factor mulop factor )], 'binary_op'], [ factor=>[qw( name )], 'first_arg'], [ factor=>[qw( number )], 'first_arg'], [ factor=>[qw( string )], 'first_arg'], [ addop=>[qw( OP_ADD )], 'first_arg'], [ addop=>[qw( OP_SUB )], 'first_arg'], [ assop=>[qw( OP_EQ )], 'first_arg'], [ assop=>[qw( OP_EQADD )], 'first_arg'], [ logop=>[qw( OP_NEQ )], 'first_arg'], [ mulop=>[qw( OP_MUL )], ], [ mulop=>[qw( OP_DIV )], ], ], });

It's a pretty simple grammar, but it's coming along nicely. I'm guessing that when the grammar is done, it'll be about twice this size.

Writing character-by-character grammar rules to recognize numbers, keywords, strings, comments, etc. would be a pain in BNF. I don't really look forward to writing a zillion BNF productions to specify what tokens look like character by character. But that sort of thing is trivial for regexes, so I split the lexer out into a simple bit of regex code to create the tokens, and the grammar is relatively straightforward too.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^10: Block-structured language parsing using a Perl module?
by BrowserUk (Patriarch) on Aug 18, 2012 at 10:16 UTC
    I really don't see any need for the lexer to do all the machinations you're worrying with.

    I don't think you read what I wrote ... or I wrote it badly. I'm not worrying about any machinations in the lexer.

    I don't want (and assert, shouldn't have) to write my own lexer.

    it's pretty straightforward,

    For this language may be so, but imagine how much more straight forward it would be if you didn't have to write it.

    By definition, the grammar contains all the terminal symbols, and how those terminal symbols can be combined.

    It could produce your %keywords hash for you from the grammar, and in the process ensure that the grammar and the lexer's hash of tokens, remain in synchronisation.

    But more than that, it also knows at each stage what token(s) are possible next in the language going forward from the point it is currently at, so it could inspect the next part of the data and very quickly determine whether what is there makes sense in context.

    I assert, it not only could, it should.

    and it doesn't worry about tokens other than the current one.

    One example does not an (counter-) argument make.

    More to the point: You extract the next token and pass it to the parser, and the parser rejects it.

    What do you report? What can you report? About all you can say, given your lexer's lack of context, is:

    [source.file:123:31] Numeric literal '123' not expected at this time.

    Not so helpful.

    Whereas the parser could report something like:

    [source.file:123:31] parsing 'while', expecting '('; got '123'

    Which would you prefer?

    Writing character-by-character grammar rules to recognize numbers, keywords, strings, comments, etc. would be a pain in BNF. I don't really look forward to writing a zillion BNF productions to specify what tokens look like character by character. But that sort of thing is trivial for regexes, so I split the lexer out into a simple bit of regex code to create the tokens, and the grammar is relatively straightforward too.

    But why would you? Why not supply your identifier syntax; literal syntax etc. to the parser as regex!

    I don't anticipate changing anyone's mind immediately. I'm expressing my reasoning for rejecting the module, but that won't make it disappear from cpan, or stop anyone who want to use it from doing so.

    The job I'm taking on that requires a real parser, is sufficiently long-term and complex, that it is worth my trying to avoid the duplication of effort, and parallel resource maintenance, that I see being required by using Marpa.

    Even if that means writing my own parser than generates a lexer as a part of the grammar compile step.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      BrowserUk:

      I've been meaning to follow up on this post for some time, but was recently strapped for 'round tuits'. I don't think you wrote it badly, more like I just disagreed. The followup I was going to post was a set of reasons that it's a good idea to split the lexer and parser into different sections. However, with the new additions to Marpa, I'm rather hard pressed to think of a good reason. ;^)

      I don't know if you've been following Marpa::R2, but they've been going in the direction of putting tokenizer rules in the grammar, and it looks rather good. If I were to start over on my project, I think I'd do it that way. (In fact, I've been migrating toward the scanless grammar, I just haven't moved my tokenization in there yet. I'm currently working on getting some of the semantic actions worked out. (As you can tell from the time between posts, this is just a side project, and I've not been working terribly hard on it of late.)

      So if you're still working on your project where you need a good parser engine--or are going to start another--I'd encourage you play with Marpa for a couple of hours to see if it works well for you.

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        I don't know if you've been following Marpa::R2, but they've been going in the direction of putting tokenizer rules in the grammar, and it looks rather good.

        No, I haven't. From my brief interactions with the author I had no expectation that Marpa would ever change in any way that would cause me to reconsider it.

        From a cursory scan of the R2 docs it does seem that somewhere amongst the close to 2 dozen modules that make it up there might be something that starts to look like it might do the job. But, it is really hard to tell given that the two examples are:

        • an expression parser that uses about the same number of source lines as I did for my 6 function/3 precedence level expression parser -- on top of those 2 dozens modules -- to provide a 2 function/1 precedence level expression parser.
        • The other deals with one of those completely pointless, meaningless, "languages" that does absolutely nothing, that parser theorists are so enamored of.

        If I ever find a working example of a block-structured language done using Marpa, I might look again; but I won't hold my breath.

        Unfortunately, the documentation doesn't seem to have improved. It still spends an inordinate amount of time telling me haw clever the parser is; and almost none showing me how to use it to do something realistic and useful.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.