in reply to How would you parse this?
After reading the replies, I think that I may have a feel for what you are hoping to do. My initial response to that desire was pretty much "Use any of the standard techniques that they taught in 'Compiler Writing' class". So the rest of what I'm writing is based on the supposition that you have never taken such a class. If you have, then I'm pretty sure that I've seriously misunderstood what you are asking for and you should probably just ignore this reply. Though, I end up advising against most of those standard techniques below.
My guess is that the approach that would most easily suit you is writing a recursive-descent parser.
I wouldn't suggest that you use a framework (such as Parse::RecDescent or marpa). There are technical reasons why such frameworks can have disadvantages (that I find many gloss over in trying to discourage the re-inventing of wheels by those just wanting to move their vehicle and not focused on and perhaps not well suited for actual designing of excellent wheels). Some of the reasons have to do with personality traits that I suspect that you share with me.
But the biggest reason is that it just really isn't terribly hard to write a decent rec-descent parser from scratch, especially in Perl. I'm pretty sure that your ample general programming skill is well above that required to succeed at this. There is a chance that there will be certain aspects of the task that will seem foreign, but I don't expect many or significant difficulties greater than that for you.
I'm not going to try to teach you (that would surely end badly) or even other readers how to write such a parser. I don't have any handy links to particularly good resources or tutorials that cover this topic (and I suspect such might not be a great route forward for this case anyway). I'm not going to try to write such a parser in this node.
But I have recently been looking at JSON::Tiny because it actually comes pretty close to how I would have implemented a JSON parser. Now, JSON is likely trivial compared to what you are hoping to parse. But JSON::Tiny is a decent example of how you build a rec-descent parser. So go browse http://cpansearch.perl.org/src/DAVIDO/JSON-Tiny-0.35/lib/JSON/Tiny.pm (the parts that have to do with parsing, of course).
You'll see the basic idea of having a function whose job is to consume the text that represents some structure of your grammar. And that function calls other functions that consume sub-structures. JSON::Tiny's _decode_value(), _decode_array(), _decode_object(), _decode_string(), and decode() are just such subs. And they are simple and working code that parses something that is easy to understand and so I hope they make it easy to "get" the concept of rec-descent parsing.
So you'll write _parse_program() which will need to call things like _parse_include() (my guess at what the first few lines in your example represent) and _parse_statement(). For most languages, the structure of each such routine ends up closely matching a BNF definition in a grammar describing your language. So you may end up writing something like BNF as you work out the definition of this language and/or the implementation of your parser.
The way that writing working code can also serve to create/refine the definition/design of your grammar is part of why I think this approach may work well for you.
When I try to reverse-engineer what are the key markers that indicate that those first few lines are supposed to be the equivalent of "#include" or use lines/statements, I think I'm close to guessing wildly. In C, you know such a thing because the lexer gave the following tokens in exactly this order: start-of-line, '#', 'include'. In Perl, you know such because you were expecting the start of a statement when you found the token "use". These are typical of how such items of syntax are defined and parsed in an LALR(1) language or similar language. "LALR(1)" just means that you can tell what type of broad syntaxy construct you have started parsing by strictly moving forward in the text and only looking at the next 2 tokens (Looking Ahead, Left-to-Right, 1 token past the next). You never have to "back up" more than one token. That's why a C parser needs to know which words represent data types vs. other things and why Perl has prefixing keywords like 'sub'.
So you may not be dealing with a language that fits in the LALR(1) world. This is another reason why an existing tutorial or existing framework isn't what I'm recommending. I won't discourage you from "straying" beyond LALR(1). In writing the parser, you'll see where the language is "too ambiguous" and adjust it.
One of the challenges in writing a parser (including rec-descent), is how to implement "consume the text". The classic solution is to write a lexer that returns tokens and give yourself a way to look at the next 2 tokens (without consuming them) or a way to "unconsume" at least the last 2 tokens one token that you consumed. But you don't have to do it that way (and I don't think you should start with that for this case).
The easiest solution is what JSON::Tiny does: read the whole document into a buffer and run regexes over it always using /\G.../gc in a scalar context (so pos tracks your progress in "consuming"). You may want to, fairly early in the development, encapsulate the applying of /.../gc to the buffer if you ever expect to parse documents that don't easily fit in RAM.
I also recommend completely avoiding capturing parens as they kill the performance of this approach, as can be seen by using Parse::RecDescent. Nevermind! (see replies)
At long last, I'll give you some custom concrete code to demonstrate one place JSON::Tiny gets this wrong (IMHO) and how I'd do it better. JSON::Tiny includes:
sub _decode_string { my $pos = pos; # Extract string with escaped characters m!\G((?:(?:[^\x00-\x1f\\"]|\\(?:["\\/bfnrt]|u[0-9a-fA-F]{4})){0,3276 +6})*)!gc; # segfault under 5.8.x in t/20-mojo-json.t #83 my $str = $1;
Disadvantages of that code:
A big improvement, IMHO, is more like:
sub _decode_string { my $pos = pos(); m{\G(?:[^\\"]+|\\.)*}gc; my $str = substr( $_, $pos, pos()-$pos ); Fail("Unclosed string",$pos) if ! m{"}gc;
Then go through parsing, validating, replacing, and complaining about escapes after that.
If you decide to deal with "won't fit in memory" documents, then parsing string literals might look more like:
sub parse_value { if( peek( qr{['"]} ) ) { parse_string(); ... sub parse_string { my $start = get_pos(); # now a seek position, not a string offset my $str = ''; if( eat("'") ) { # Inside '...', '' is a literal ', all else is just literal: while( ! peek("'(?!')",1) ) { # ^ keep leading whitespace if( eat("''",1) ) { $str .= "'"; } else { swallow("[^']+",1,\$str,1); # fatal if no match # Can match to $Buf end ^ } } swallow("'",1); } else { swallow('"'); while( peek('[^"]',1) ) { while( peek('\\',1) ) { $str .= parse_escape(); } eat( qr{[^\\"]*}, 1, \$str, 1 ); } swallow('"',1,'',0,'Unclosed string',$start); } return $str; }
Here's a stab at most of the "consume the document stream" functions used above (nothing tested, in case that wasn't as obvious as I think it would be):
my $FH; my $Buf; # Consume matching text, if found next: sub eat { my( $re, # Regex or string to match against next part o +f doc. $keep_ws, # Don't skip over whitespace first? $sv, # Append matched text to referenced scalar (if + any). $to_end, # Allow match to hit end of $Buf? $peek, # Keep pos() unchanged? ) = @_; $Buf =~ /\G\s+/gc if ! $keep_ws; my $start = pos( $Buf ); $re = _compile( $re ); # Prepends \G also do { return 0 if $Buf !~ /$re/gc; } while( ! $to_end && _hit_end(\$start) ); $$sv .= substr( $Buf, $start, pos($Buf)-$start ) if $sv; pos($Buf) = $start if $peek; return 1; } # Tell me if next text matches (without consuming, except maybe whites +pace): sub peek { my( $re, # Regex or string to match against next part o +f doc. $keep_ws, # Don't skip over whitespace first. $sv, # Append matched text to referenced scalar (if + any). $to_end, # Allow match to hit end of $Buf. ) = @_; return eat( $re, $keep_ws, $sv, $to_end, 1 ); } # Consume matching text! If not found next, then die: sub swallow { my( $re, # Regex or string to match against next part o +f doc. $keep_ws, # Don't skip over whitespace first. $sv, # Append matched text to referenced scalar (if + any). $to_end, # Allow match to hit end of $Buf. $err, # Error name if no match. @args, # Extra details for above error. ) = @_; return 1 if eat( $re, $keep_ws, $sv, $to_end ); throw( $err ||= 'Internal bug in parser', @args ); } sub _compile { my( $re ) = @_; our %Compiled; return $Compiled{$re} ||= qr/\G$re/; } # Keep sliding window of document in $Buf: sub _hit_end { my( $sv_start ) = @_; my $pos = pos( $Buf ); return 0 if $pos < length($Buf)-1024; # Darn! Match got too close to end of $Buf. if( 2048 < $$sv_start ) { # Copy to left as little of $Buf as reasonable: my $skip = $$sv_start - 1024; $$sv_start -= $skip; # This way may just keep consuming memory: # substr( $Buf, 0, $skip, '' ); $Buf = substr( $Buf, $skip ); } sysread( $FH, $Buf, 1024*1024, length($Buf) ); # Tell caller to re-do match; have it compare at same part of $Buf + again: pos($Buf) = $$sv_start; return 1; }
I sincerely hope some of that was helpful.
- tye
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: How would you parse this? (bug)
by tye (Sage) on Oct 26, 2013 at 16:18 UTC | |
by BrowserUk (Patriarch) on Oct 26, 2013 at 17:18 UTC | |
|
Re^2: How would you parse this? (placeholder for a more considered reply.))
by BrowserUk (Patriarch) on Oct 26, 2013 at 10:19 UTC | |
|
Re^2: How would you parse this? (rec-descent)
by smls (Friar) on Oct 26, 2013 at 21:26 UTC | |
by tye (Sage) on Oct 27, 2013 at 02:31 UTC |