tj_thompson has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

My ongoing Marpa adventures bring me back with some basic questions. I have traced it back to this test case:

use strict; use warnings; use Marpa::R2; use Data::Dumper; # Example 1 { my $g_str = <<'END_GRAMMAR'; :start ::= data data ::= '12' '3' END_GRAMMAR my $g_obj = Marpa::R2::Scanless::G->new({ source => \$g_str }); my $p_obj = Marpa::R2::Scanless::R->new({ grammar => $g_obj, trace_values => 1, trace_terminals => 1 }); my $s = '12'; $p_obj->read( \$s ); print "EXAMPLE 1 VALUE:".Dumper($p_obj->value)."\n"; } # Example 2 { my $g_str = <<'END_GRAMMAR'; :start ::= data data ::= many_a 'A' many_a ~ [A]+ END_GRAMMAR my $g_obj = Marpa::R2::Scanless::G->new({ source => \$g_str }); my $p_obj = Marpa::R2::Scanless::R->new({ grammar => $g_obj, trace_values => 1, trace_terminals => 1 }); my $s = 'AA'; $p_obj->read( \$s ); print "EXAMPLE 2 VALUE:".Dumper($p_obj->value)."\n"; }

In example 1, given that the input string is only '12', but the grammar specifies two tokens ('12' and '3'), I would expect the $p_obj->read call to fail to parse. However, no failure occurs.

I do see the first lexeme being accepted in the trace output. But I do not see the second '3' lexeme being accepted or rejected. The $p_obj->value result is undefined, which would make sense if the read did not complete.

If a '3' is appended to the input string, the $p_obj->value call then shows the expected return value.

In example 2, the first lexeme that will accept multiple 'A' characters matches both, leaving none for the second lexeme requiring a single 'A' character. The end result here seems to be the same. The $p_obj->read call does not fail, but the value ends up as undef.

This seems to imply a greedy match on the first lexeme that will not allow the second lexeme to match. However, I would again expect a grammar failure during the read call.

So on to my questions.

As always, many thanks for the time and insight :)

EDIT: added in the read calls I left out from my copy and paste as Amon pointed out below.

EDIT2: As this question is very Marpa grammar/behavior specific, and less about Perl, I've posted a my below variant of this question based on feedback to the Google Marpa group here: https://groups.google.com/forum/#!topic/marpa-parser/fZzhxdBDbGk. I sincerely apologize if this is considered bad posting etiquette, but at this point I believe it's probably the more appropriate forum for the question. I will monitor both and ensure what I learn there is posted here in hopes that it will help others.

EDIT3: See the above link. Turns out assuming the input string matched the grammar because the read method did not throw an error is incorrect.

EDIT4: As for example 2 above, Jeffrey was able to offer some insight again. The lexer is greedy by nature (rules defined with '~'). The structural rules (defined with '::=' are much better at handling ambiguity. His solution was to move part of the logic into a structural rule. See post here https://groups.google.com/forum/#!topic/marpa-parser/6jgQj-MOLGM

Replies are listed 'Best First'.
Re: Marpa -- Partial grammar match does not result in failure
by amon (Scribe) on May 08, 2014 at 01:45 UTC

    First, let us realise that your examples don't actually perform a ->read(\$s), but we'll just assume it's somewhere in there.

    Secondly, let's be clear about the behaviour of read() and value(): read() returns the position in the input stream where scanning ended. It will throw an error if the input couldn't be parsed e.g. because no expected lexeme matched the input at the current location. It will not throw an error if the input simply has been exhausted. That is, it only throws an error if there was a problem with respect to the low-level grammar which represents the lexer.
    The value() method will return either undef if there was no successful parse, or a reference to the result if the parse was successful. As you do not currently specify any semantics, the result would be undef, so we'd end up with a reference to undef in the case of a successful parse. That is, this indicates the success of the high-level grammar which does the actual parsing.

    Now let's look at the first example. Here, the input is accepted by the low-level grammar, as the token '12' can be matched. Therefore, read() succeeds. However, the high-level grammar also requires the '3' token, which the input doesn't contain. As the input has ended, the high-level grammar will fail, and value() will be undef, indicating a parse failure.

    Now the second example. Here, the low-level grammar provides the lexemes many_a ~ [A]+ and the anonymous 'A'. Due to longest-token matching, the input 'AA' will be completely consumed by the many_a lexeme. This token is accepted by the high-level grammar. Now the input is exhausted, so the token 'A' will not be found.

    For efficiency, Marpa never backtracks like Perl regexes do. Especially, Marpa's low-level grammar is equivalent to regular expressions (the computer science thing, not Perl regexes), which can be recognized in linear time. I.e., this is efficient as hell, but not very expressive. If you need more features, you must use the high-level grammar and resolve the ambiguity there, or switch to manual lexing (via pause events … but you'll come to that later).

    If you want a successful parse to have a more interesting value(), just add a :default ::= action => [name,values] to your grammar, which will create an array ref for each rule.

      First, a personal thanks for taking the time to respond Amon. It's clear you have a strong understanding of Marpa and your response to this question and my previous questions have been exceedingly useful. I have been reading through the documentation but I am finding it difficult to find what I need as there is much to absorb.

      You are correct that the 'read' call was in my code, it was missed in my copy/paste. I have edited the main post to reflect this change.

      I had assumed that the behavior here was just something that I did not understand, not that Marpa was incorrect or faulty. Your response confirms this. So a couple of followup questions.

      1) Given this information, it would seem to me that you should be able to differentiate between the parser exhausting the input as opposed to successfully parsing the entire grammar. Is there some way to determine this?

      After a bit of thought, I'm assuming that determining what constitutes 'parsing the entire grammar' would be a difficult problem in and of itself considering the way a parse tree is created. Exhausting the input string is probably a reasonable indicator of a grammar having been parsed successfully.

      2) How would you resolve the problem in example 2 assuming this was part of a more complex grammar? Considering you are locked into a greedy match behavior on a per token basis, this seems difficult to resolve.

        It appears that it is possible to differentiate a failed match vs. an exhausted input string. If the recognizer specifically rejects the string as unable to match, it will throw an error. If the input string is exhausted, but has conformed to the grammar requirements up till that point, the recognizer instead pauses the parse. This allows for the opportunity to add to the input string and resume parsing.

        However, if the full grammar has not been matched and the recognizer is paused, the 'value' method will result in an undef return. An undef return from the 'value' method means there was no parse result, or a failure.

        This is pretty much exactly what Amon was telling me above but I didn't quite put things together until I understood that you could pause the parsing, add to your input, and resume parsing.

        Short story is don't rely on an error from the recognizer to determine whether the grammar was matched. An error clearly indicates a failed match, but no error can mean 'input string was exhausted, but does not yet match fully' or 'grammar was matched successfully'. To differentiate these, ensure the value returned from the 'value' method is defined. Thanks to the Marpa author Jeffrey for the insight on this point https://groups.google.com/forum/#!topic/marpa-parser/fZzhxdBDbGk

      After some further testing, I still do not see the value being updated in the case of the partial (but apparently successful) parse. I have added the :default rule you suggest above to two cases. Example 1 has been modified to match all tokens in the grammar( '12' and '3' ). It does return the expected 'name, value' structure. However, Example 2 retains the original string that only matches the first token ('12'). If the parse was successful, I would expect at least the matched tokens to appear in the value data, but it does not appear to exist. I have taken the time to read through the read and value documentation you linked above, but this information does not seem to explain the results I see here (or at least I'm not making sense of it).

      use strict; use warnings; use Marpa::R2; use Data::Dumper; # Example 1 { my $g_str = <<'END_GRAMMAR'; :default ::= action => [name, value] :start ::= data data ::= '12' '3' END_GRAMMAR my $g_obj = Marpa::R2::Scanless::G->new({ source => \$g_str }); my $p_obj = Marpa::R2::Scanless::R->new({ grammar => $g_obj, trace_values => 1, trace_terminals => 1 }); my $s = '123'; $p_obj->read( \$s ); print "EXAMPLE 1 VALUE:".Dumper($p_obj->value)."\n"; } # Example 2 { my $g_str = <<'END_GRAMMAR'; :default ::= action => [name, value] :start ::= data data ::= '12' '3' END_GRAMMAR my $g_obj = Marpa::R2::Scanless::G->new({ source => \$g_str }); my $p_obj = Marpa::R2::Scanless::R->new({ grammar => $g_obj, trace_values => 1, trace_terminals => 1 }); my $s = '12'; $p_obj->read( \$s ); print "EXAMPLE 2 VALUE:".Dumper($p_obj->value)."\n"; }

      Output:

      Setting trace_terminals option Setting trace_values option Lexer "L0" accepted lexeme L1c1-2: '12'; value="12" Lexer "L0" accepted lexeme L1c3: '3'; value="3" EXAMPLE 1 VALUE:$VAR1 = \[ 'data', '12', '3' ]; Setting trace_terminals option Setting trace_values option Lexer "L0" accepted lexeme L1c1-2: '12'; value="12" EXAMPLE 2 VALUE:$VAR1 = undef;

      In example 2, given there is now a :default rule for returning match data from a structural rule, and that the parse was successful, why is there no output from value?

      My guess is that since the 'data' token requires both '12' and '3' to match, 'data' was not successfully matched. This means there is no returned value for the token, resulting in an empty value data structure. But this sounds an awful lot like a grammar parse failure that should result in a ->read failure.

      EDIT: As it turns out, I am making the faulty assumption that because the read method did not throw an error, the parse was successful. This is not necessarily true. See my post above Re^3: Marpa -- Partial grammar match does not result in failure

Re: Marpa -- Partial grammar match does not result in failure
by BrowserUk (Patriarch) on May 08, 2014 at 00:41 UTC
    1) Why does the $p_obj->read call not fail in these two cases?

    That's the archetypal problem with libraries that complex; they have so many failure modes for any given input; it is impossible to imagine them all, much less test them all or even a statistically significant subset of them.

    Upshot: despite the usually -- and in this case, definitely -- effusive and over-elaborate claims, they are black holes to the user. Ie. Their users have no hope of understanding, much less correcting, any failure mode that the author hasn't considered.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      While you are generally correct that complex software is complex, and thus invariably contains bugs, in this case the behaviours of read() and value() are well-defined and consistent with the observed behaviour – it's all there in the docs.

      Understanding them is often difficult, and examples are sparse, but it's by no means an impenetrable black hole of knowledge. Please don't spread ignorant FUD and don't assume the worst by default, sometimes there actually is a good explanation.

Re: Marpa -- Partial grammar match does not result in failure
by locked_user sundialsvc4 (Abbot) on May 08, 2014 at 12:38 UTC

    I am not so familiar with the Marpa parser-suite ... yet! ... but from my experience with other tools I find this production to be questionable:

    data ::= many_a 'A'
    many_a ~ [A]+

    This appears to me to be a “head-recursive rule,” which would (might? .. here and elsewhere; see bullet-points at bottom) cause a parse to drop down a rabbit-hole.   In other words, it consists of a production that will either (as the case may be) recurse endlessly, or match endlessly, followed by a production ('A') that now cannot be matched because what is supposed to be the terminal-token has already been consumed.   If the rule calls for “a list of 'A's, followed by 'A',” then the first part will consume all of the 'A's, leaving no 'A' to match the final part of the rule, and so the entire rule will fail.

    Contrast this to an equivalent “tail-recursive” rule, which I will now express in a nomenclature that is more-familiar to me:

    <a_list> ::= 'A' <a_list>*
    This rule (in BNF) says that an <a_list> consists of the letter 'A', optionally followed by zero-or-more <a_list>s.   The lexer will be smart enough to know what this really means and how to handle it efficiently.   The key difference is that, as soon as that first letter 'A' is seen and matched, the rule is now accepted, no matter how many letter 'A's the token contains.

    Like I said, I am not too familiar with the Marpa tool yet, although I will make it a point to do so.   But it also seems to me that there might be a conceptual issue in your plans with regard to the respective role of the lexer and the parser in such a system.   The lexer’s only purpose is to assemble a string of characters into a stream of tokens, which is then fed to the parser, which therefore never has to deal with characters.   Some tools blur the distinction between these two concerns, but the two concerns are still there.

    The “tail-recursive rule” issue will be present at both conceptual levels, lexer and parser.   In some classic systems, like lex/yacc, an error in this area would cause for example the famous message, shift/reduce conflict.

    Now, having said all that, I see the following notes on the MARPA website (http://jeffreykegler.github.io/Marpa-web-site/):

    • Marpa parses anything you can write in BNF, including ambiguous and even infinitely ambiguous grammars.   (!!)
    • Marpa easily and efficiently handles both left- and right-recursion.
    • If a grammar is in one of the classes in practical use today, Marpa parses it in O(n) (linear) time.   Marpa never goes exponential.   Worst case, even for wildly ambiguous grammars, is O(n3) (cubic) time.
    • Marpa's run-time error detection is revolutionary.   Marpa has complete situational awareness.   It knows at all times which rules it is attempting to apply, how far it has progressed in them, and exactly which tokens it can accept.   And Marpa can communicate its situational awareness back to the application.
    • Marpa allows the application to efficiently retry rejected input. This, combined with its situational awareness, allows “Ruby Slippers” parsing:   When an application has a token rejected, it can ask the parse engine which tokens would be acceptable. It can then create a virtual token that allows the parse to continue – the application can make the parse engine's "wishes" come true.  The Ruby Slippers are easy to use and surprisingly wide in their application.

    All in all, therefore, a remarkably-different parser, but also one that would seem to impose a lot of requirements to go along with its abilities.   (The last two bullet-points, especially.)   Verr-r-r-ry interesting . . .   (And I find myself idly pondering ... do I want my parser to be “incredibly smart,” or would I prefer it to be fast-but-stupid?)   :-/