beppu has asked for the wisdom of the Perl Monks concerning the following question:

I've been using Parse::RecDescent for the past few months, and I've been hitting some of its limits. It becomes way too slow when parsing large (64KB+) scalars. Don't even think about handing it a 1MB scalar. You will wait for hours for it to finish. That being said, I still think Damian Conway is the man, but I need to find a faster solution.

I've decided on Parse::Yapp which seems to be modeled after yacc. There is nothing like lex that goes with it, though -- you're supposed to provide your own lexer. Unfortunately for me, I don't have much in the way of formal knowledge when it comes to parsing. Using Parse::RecDescent was my first experience with actually writing a grammar and systematically parsing text. ...and Parse::RecDescent kinda meshes the lexing and parsing parts together.

this is where I get confused

I don't know where to draw the line between lexer and parser. My understanding is that a lexer is supposed to go through data and make tokens out of it to feed to the parser. The parser has higher level knowledge about these tokens in the form of a grammar and tries to make sense of this stream of tokens. That seems simple enough at first, but when I try to think about how I would write a lexer, it seems as if the lexer needs to know a little about the grammar.

Let me give a contrived example.

my $x = "my $x = \"my $x\"";

In this case, qmy $x occurs three times in two different contexts. Once as a perlop and variable, and the other two times as a quoted string. If any of you out there have any experience with writing lexers, help me out here.

What would be the stream of tokens returned by the above line of code assuming Perl semantics? I have never written a lexer so it'd be nice if I could get an example with some rationale. What I'm trying to find is what the right balance of responsibilities is between a lexer and a parser. I'd be especially interested in what goes on inside the quoted string. Is it all one big token, or is it split into a bunch of little tokens for the parser to assemble together into something coherent.

If you've read this far, thanks for your patience.

Replies are listed 'Best First'.
Re: The Relation Between Lexers and Parsers
by rlk (Pilgrim) on Jan 20, 2001 at 03:25 UTC
    Coincidentally enough, I've been working on a lexer/parser combo (in C++) for one of my classes, so I guess I'll take a stab at answering this.

    The lexer has to be smart enough to recognize quoted strings as individual elements, consequently it would return something along the lines of

    ('my', '$x', '=', 'my $x = \"my $x\""', ';')
    Some lexers might even be smart enough to classify tokens, and return something like this:
    ([ 'MY' => 'my'], ['VAR' => '$x'], ['OPERATOR', '='], ['STRING' => 'my + $x = \"my $x\""'], [TERMINATOR => ';']).
    To clarify, the lexer needs to understand the nature of the tokens that it generates, at a minimum knowing things such as which characters can be used in an identifier, and what the quoting rules are. (Quick, what does perl -e 'for qw(foo bar) { print $_ }' do?) The parser's knowledge relates to what order tokens can legally come in, and what different combinations of tokens mean.

    If you're interested in reading more about specifically parsing perl, you should read merlyn's On Parsing Perl.

    --
    Ryan Koppenhaver, Aspiring Perl Hacker
    "I ask for so little. Just fear me, love me, do as I say and I will be your slave."

      Thanks. This was just what I was looking for. FYI, my problem doesn't involve parsing perl (thank $deity). It's a silly format that looks like XML but isn't. It was made up by some silly mormons.

      Embedix::ECD

        If you need to parse something similar to XML maybe you can use a modified version of Robert Cameron's REX::XML, which parses XML using regular expressions, a very nice piece of work BTW

Re: The Relation Between Lexers and Parsers
by cephas (Pilgrim) on Jan 20, 2001 at 10:36 UTC
    Here's a quick non OO lexer that is very similar to the lexer Damian conway creates in his book Objected Oriented Perl (in the book, he blesses a subroutine that is almost exactly like this, and then adds some utility methods to go along with it.) This will break my $x = "my $x = \"my $x\""; into its components (keyword, variable, quoted string, and semicolon)
    Note: all you have to do is define your tokens in terms of regular expressions.

    #!/usr/bin/perl -w use strict; my $tokens = { 'KEYWORD' => 'my', 'QSTRING' => '"(?:[^"]|\\")*"', 'VAR' => '\$\w+', 'SEMI' => ';', }; open(FH,"$ARGV[0]") || die("Error opening $ARGV[0]: $!\n"); undef $/; my $text = <FH>; close(FH); my $lexer = &create_lexer($tokens); { my @token = &$lexer($text); last unless(defined($token[0])); print("Got: $token[0] => $token[1]\n") if(defined($$tokens{$token[0]} +)); redo; } sub create_lexer { my($tokens) = shift; my($sub,$code,$token,$key); foreach $key (keys(%$tokens)) { $code .= '$_[0] =~ s/\A\s*?(' . $$tokens{$key} . ')// '; $code .= "&& return('$key'," . '$1);' . "\n"; } $code .= '$_[0] =~ s/\A\s*?(\S)// && return("",$1);'; $code .= "return(undef,undef);\n"; $sub = eval("sub { $code } ") || die($@); return($sub); }


    cephas

    BTW: Did I mention that Damian Conway wrote a great book <bold>Object Oriented Perl</bold> (And just in case there was any doubt, I got my idea for this lexer from him, and not the other way around. :) )
Re: The Relation Between Lexers and Parsers
by MeowChow (Vicar) on Jan 20, 2001 at 10:03 UTC
    This is a great question. (Well, in my mind, any question that I've struggled with is a great question :)

    FWIW, here is a working lexer I wrote for use with Parse::Yapp, for a simple OO class-specification/templating language... if you have any questions about it, let me know:

    sub Lexer { my $parser = shift; defined $parser->YYData->{INPUT} or return('',undef); if (not defined $parser->YYData->{LINE}) { $parser->YYData->{LINE} = 0; $parser->YYData->{SLINE} = 0; $parser->YYData->{ADDLINES} = 1; } $parser->YYData->{INPUT} =~ s/^[ \t\r]*//; # remove whitespace $parser->YYData->{INPUT} =~ s/^#[^\n]*//; # remove comments for ($parser->YYData->{INPUT}) { s/^([\n{}:])// and do { return($1,$1); # character tokens }; s/^(\w+)\b// and do { if (exists $modifiers{$1}) { return('MODIFIER', $1); } elsif (exists $keywords{$1}) { return('KEYWORD',$1); } else { return('IDENT', $1); } }; s/^('(.*?)')// and do { return('STRLIT',$2); }; s/^("(.*?)")// and do { return('STRLIT',$2); }; s/^(<{(.*?)}>)//s and do { my $heredoc = $2; $parser->YYData->{ADDLINES} += () = $heredoc =~ m/\n/g; return('HEREDOC',$heredoc); # woohoo i love heredoc's! }; } }
Re: The Relation Between Lexers and Parsers
by cephas (Pilgrim) on Jan 20, 2001 at 03:46 UTC
    You should probably take a look at Parse::Lex....

    Also writing a small lexer in perl can be fairly trivial depending on your requirements (do you want to actually remove the items from the input stream or do you want to leave them, support regular expressions or not,etc...) Damian Conway actually gives an example of building an object oriented lexer in his book Object Oriented Perl.

    cephas
Re: The Relation Between Lexers and Parsers
by t'mo (Pilgrim) on Jan 20, 2001 at 07:55 UTC

    There may be other references, but as far as I know, the reference on this sort of thing is this:

    <cite>Compilers, Principles, Techniques, and Tools; Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman.</cite>

    (Be prepared for lots of reading, though. And it would probably help to pick up some knowledge of the theory of computation. :-)

Re: The Relation Between Lexers and Parsers
by beppu (Hermit) on Jan 21, 2001 at 21:17 UTC
    I'd just like to thank everyone who replied. I'll be looking over your suggestions over the next few days. Oddly enough, I have Damian Conway's OO-Perl book, but I never bothered to really read it -- I just skimmed through it and thought to myself, "damn, he's smart".

    To rlk, mirod, cephas, t'mo, and MeowChow -- thanks for your help.