Cannot get Marpa::R2 to prioritise one rule over another

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a Marpa::R2 parser that is attempting to differentiate between IP addresses and hostnames without a difference in leading keywords. The actual grammar I am using is complicated enough not to try and replicate it here, but a minimally-reproducable example of the same problem is below:

#!/usr/bin/env perl
use warnings;
use strict;

use Data::Dumper;
use Term::ANSIColor qw(:constants);
use Marpa::R2;

my $rules = <<'END_OF_GRAMMAR';
    lexeme default  = latm => 1
    :default        ::= action => [name,values]
    :start          ::= <entry>

    <entry>         ::= <op> (SP) <hostaddr4>
    <op>            ::= 'add' | 'remove'
    
    <ipv4>          ::= NUMBER ('.') NUMBER ('.') NUMBER ('.') NUMBER
    <hostname>      ::= NAME
    
    <hostaddr4>     ::= <ipv4> | <hostname>
    
    SP              ~ [\s]+    
    NAME            ~ [\S]+
    NUMBER          ~ [\d]+
END_OF_GRAMMAR

my $input = <<'END_OF_INPUT';
add 192.0.2.1
add www.example.org
remove 192.0.2.2
END_OF_INPUT

my $grammar = Marpa::R2::Scanless::G->new({source => \$rules});

for (split /^/m, $input) {
    chomp;
    if (length $_) {
        print "\n\n$_\n";
        
        my $recce = Marpa::R2::Scanless::R->new({
            grammar => $grammar, 
            ranking_method => 'rule'
        });
        
        eval { $recce->read(\$_ ) };
        print ($@ ? (RED . "$@\n") : GREEN);
        print $recce->show_progress(), "\n";
        print Dumper($recce->value), "\n\n", RESET;
    }
}
[download]

From what I can tell, Marpa always picks the <hostname> form of the grammar, even on lines that look more like IPs. I assume this is because the character class [\S]+ also includes the characters which make up an IP address.

So far, in my grammar definition, I've tried:

<hostaddr4>     ::= <ipv4> | <hostname>

<hostaddr4>     ::= <ipv4> || <hostname>

<hostaddr4>     ::= <hostname> | <ipv4>

<hostaddr4>     ::= <hostname> || <ipv4>

<hostaddr4>     ::= <ipv4>      rank => 2
                  | <hostname>  rank => 1


<hostaddr4>     ::= <ipv4>      rank => 1
                  | <hostname>  rank => 2


<hostaddr4>     ::= <ipv4>      rank => 1
<hostaddr4>     ::= <hostname>  rank => 2

<hostaddr4>     ::= <hostname>  rank => 1
<hostaddr4>     ::= <ipv4>      rank => 2
[download]

...and none seem to make a difference. They all yield the ['hostname', '192.0.2.1'] array.

The only thing that does it is removing the <hostname> alternate from <hostaddr4> (which does not match the grammar of the data I am parsing), and then the representation changes to ['ipv4', '192', '0', '2', '1']

Can anyone advise the correct approach in this (seemingly) simple case?

Comment on Cannot get Marpa::R2 to prioritise one rule over another Select or Download Code

Replies are listed 'Best First'.
Re: Cannot get Marpa::R2 to prioritise one rule over another by Discipulus (Canon) on Jan 21, 2021 at 07:57 UTC
Hello J. UPDATE I probably missed your point.. now I see the IP is never reachead but always treated as if it was and hostname. I'll think a bit more about it.. The problem seems to be fixed if you put `NAME ~ [\D]+` but you will fail again with a hostname like `42.perl.org` Perhaps a more strict rule definition is needed to tell difference from ip and hostname. original reply My help can be very limited because I still do not understand `Marpa::R2` and I'm just moving my first, baby steps. I dont understand the `rank` nor the `show_progress` part (atm). So I reduced the example to something I know (removing the colors). Is not the `hostname` coming from your `:default ::= action => [name,values]` ? This is what proposed in the synopsis but I find it a bit misleading. If you see A dice roller system with Marpa::R2 and its prequel First steps with Marpa::R2 and BNF you will see an anonymous hash is used and is populated during the parsing phase. Maybe you can use a pattern like this (then you can check the validity of an IP or of a valid hostname in distinct part of the code). I ended with the following code that seems to produce the expected result Read more... (2 kB) L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re: Cannot get Marpa::R2 to prioritise one rule over another by Discipulus (Canon) on Jan 21, 2021 at 12:32 UTC
Hello again, maybe a second attempt is better than first one. I had to specify what an hostname is in an ugly way but seems viable. I'm going mad to understand why the dot `.` is passed in for IPs and not for hostnames! (because ip ends with an action?) #!/usr/bin/env perl use warnings; use strict; use Data::Dump; use Marpa::R2; my $rules = <<'END_OF_GRAMMAR'; lexeme default = latm => 1 :default ::= action => ::first entry ::= op hostaddr4 action => dump_entry op ::= 'add' action => add_op \| 'remove' action => add_op hostaddr4 ::= hostname \| ipv4 hostname ::= DOMAIN EXT action => add_hostname \| DOMAIN DOMAIN EXT action => add_hostname \| DOMAIN DOMAIN DOMAIN EXT action => add_hostname DOMAIN ::= NAME '.' NAME ~ [\d\w]+ EXT ~ 'org' \| 'net' ipv4 ::= NUMBER '.' NUMBER '.' NUMBER '.' NUMBER action => +add_ip NUMBER ~ [\d]+ :discard ~ SP SP ~ [\s]+ END_OF_GRAMMAR my $input = <<'END_OF_INPUT'; add example.org add www.perl.org add 42.perl.net add 192.0.2.1 remove 192.0.2.2 END_OF_INPUT my $grammar = Marpa::R2::Scanless::G->new({source => \$rules}); for (split /^/m, $input) { chomp; if (length $_) { print "\nPARSING: $_\n"; my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar, }); my $value_ref = $grammar->parse( \$_, 'main'); } } sub dump_entry{ print "dump_entry received: "; dd shift @_; } sub add_op{ my $self = shift @_; print "add_op received: "; dd @_; $$self{operator} = join '',@_; return $self; } sub add_ip{ my $self = shift @_; print "add_ip received: "; dd @_; $$self{type} = 'IP'; $$self{value} = join '',@_; return $self; } sub add_hostname{ my $self = shift @_; print "add_hostname received: "; dd @_; $$self{type} = 'hostname'; $$self{value} = join '.',@_; return $self; } __DATA__ PARSING: add example.org add_op received: "add" add_hostname received: ("example", "org") dump_entry received: { operator => "add", type => "hostname", value => + "example.org" } PARSING: add www.perl.org add_op received: "add" add_hostname received: ("www", "perl", "org") dump_entry received: { operator => "add", type => "hostname", value => + "www.perl.org" } PARSING: add 42.perl.net add_op received: "add" add_hostname received: (42, "perl", "net") dump_entry received: { operator => "add", type => "hostname", value => + "42.perl.net" } PARSING: add 192.0.2.1 add_op received: "add" add_ip received: (192, ".", 0, ".", 2, ".", 1) dump_entry received: { operator => "add", type => "IP", value => "192. +0.2.1" } PARSING: remove 192.0.2.2 add_op received: "remove" add_ip received: (192, ".", 0, ".", 2, ".", 2) dump_entry received: { operator => "remove", type => "IP", value => "1 +92.0.2.2" } [download] L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^2: Cannot get Marpa::R2 to prioritise one rule over another by choroba (Cardinal) on Jan 21, 2021 at 17:16 UTC
The dot is ignored because the default action (::first) is used for the DOMAIN rule. Mixing the lexer and grammar rules is not a good idea, they're very different. Using consistent capitalization for the non-terminals also helps, I usually use a different rule for the grammar and lexer ones. I usually build the grammar from the top to the bottom, i.e. from the starting symbol to the L0 rules. I start with the default action of `[name,values]` and replace it with individual actions from the bottom to the top. The result might be something like Read more... (3 kB) `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^3: Cannot get Marpa::R2 to prioritise one rule over another by Discipulus (Canon) on Jan 22, 2021 at 09:02 UTC
Hello choroba, can you be so kind to explain me better your: > Mixing the lexer and grammar rules is not a good idea, they're very different. because I'm reading Marpa-R2 vocabulary and I am not able to strictly define them. Where my code mixes them? L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re^4: Cannot get Marpa::R2 to prioritise one rule over another by choroba (Cardinal) on Jan 22, 2021 at 09:14 UTC
Re^3: Cannot get Marpa::R2 to prioritise one rule over another by Anonymous Monk on Jan 21, 2021 at 21:07 UTC
Thanks for demonstrating how to recompose the dotted components of hostnames and IPs, using a custom action. I had been wondering how best to go about that, and you have given me a starting point. One question, regarding your `concat` subroutine, if I may: Is it possible to generalise it to return the `[rulename,concatted-string]` pair, so it conforms to the tokens emitted by the default action `[name,values]`, or would I have to have a separate subroutine for each rule (and return the rulename literally)? I had originally thought there might be context in first argument, which you `shift` over, but that appears to be an empty hashref in all cases I've seen.	[reply] [d/l] [select]
Re^4: Cannot get Marpa::R2 to prioritise one rule over another by choroba (Cardinal) on Jan 21, 2021 at 21:14 UTC
Re^5: Cannot get Marpa::R2 to prioritise one rule over another by Anonymous Monk on Jan 21, 2021 at 21:56 UTC
Some notes below your chosen depth have not been shown here
Re^2: Cannot get Marpa::R2 to prioritise one rule over another by Anonymous Monk on Jan 21, 2021 at 20:55 UTC
Thanks for this attempt, but I'm not sure that defining `hostname` as a fixed number of `DOMAIN` components, nor defining a limited set of `EXT` suffixes is the right way to go. Hostnames can be arbitrarily long, at least in terms of subdomains, and the list of top-level domains is growing by the day. I'm probably going to settle just capturing `NAME` and laying off the semantics of IPv4, (later) IPv6, and neither of those to a custom action. Given the complexity of the problem (esp. IPv6), that is likely the best way forward. J.	[reply] [d/l] [select]
Re: Cannot get Marpa::R2 to prioritise one rule over another by duelafn (Parson) on Jan 21, 2021 at 16:54 UTC
~~I don't have a fix for your actual problem, but~~ The reason it refuses to select the ipv4 is because of longest-token matching. `NAME` matches a longer token than `NUMBER`, therefore it always wins. Update: Change NAME to not accept a dot (and update hostname rule) and then it will work as you originally had: `my $rules = <<'END_OF_GRAMMAR'; lexeme default = latm => 1 :default ::= action => [name,values] :start ::= <entry> <entry> ::= <op> (SP) <hostaddr4> <op> ::= 'add' \| 'remove' <ipv4> ::= NUMBER ('.') NUMBER ('.') NUMBER ('.') NUMBER <hostname> ::= NAME+ separator => DOT <hostaddr4> ::= <ipv4> \| <hostname> DOT ~ '.' SP ~ [\s]+ NAME ~ [^\s.]+ NUMBER ~ [\d]+ END_OF_GRAMMAR` [download] Good Day, Dean	[reply] [d/l] [select]
Re^2: Cannot get Marpa::R2 to prioritise one rule over another by Anonymous Monk on Jan 21, 2021 at 20:46 UTC
Thanks. I won't say the reasoning makes absolute sense to me yet, but I do confirm that your approach does allow differentiation between IPv4 and not-IPv4 at parse-time -- albeit one that I have to recompose back to a single string. Given that I'm also going to have to handle IPv6 at some point, and the complexities involved in that, I'm probably going to flatten the `<hostaddr4>` rule to simply `<hostaddr> ::= NAME` and lay off to a custom action to determine IPv4, IPv6 or hostname in my larger grammar. It would have been nice to formalise support for the three types in the grammar definition, but I'm not going to lose sleep over it. That said, if you (or anyone) can shed light on why the order of rules, or use of \|\| vs \| makes no difference, I would be keen to understand. Props to the package author for taking the time to document thoroughly, but it's not an easy read for someone for whom this isn't going to be a full-time gig!	[reply] [d/l] [select]
Re: Cannot get Marpa::R2 to prioritise one rule over another by Anonymous Monk on Jan 21, 2021 at 10:07 UTC
Regexp::Common::net Regexp::Common::URI::RFC1035	[reply]
Re^2: Cannot get Marpa::R2 to prioritise one rule over another by Anonymous Monk on Jan 21, 2021 at 20:21 UTC
While these regex libraries are useful, I don't think I can fold them directly into a Marpa DSL definiton. From what I can tell, I can only use individial character classes, and would have to comprise a number of lexeme tokens to match the relative complexity of the IPv4 (and especially IPv6) expressions. In my more-specific application, I am indeed intending to validate the IPs externally -- probably with Net::IP. To that end, I may well alter the DSL just to accept NAME, and lay off the exact semantics to a subroutine with an `if (is_ip4(...) { ... } elsif (is_ip6(...) { ... } else { assume_hostname(...) }` flavour. Thanks for looking in... J.	[reply] [d/l]


go ahead... be a heretic
	PerlMonks