in reply to HOP::Lexer not doing what I expected

OK guys, it's getting to look worse all the time. I found a much simpler example of something that I think is going terribly wrong, and I'd like you to chew it over.
use HOP::Lexer 'string_lexer'; my $text = 'xselectx'; my $lexer = string_lexer( $text, [KEYWORD => qr/select/i], [WORD => qr/\w+/ ] );
(n.b. string_lexer is just a routine in the module that wraps the input string in an iterator, and then calls make_lexer, so we don't have to do it by hand. The code we have to write just becomes a bit simpler.)

Tell me that the result it parses into is what you think makes sense. Because it doesn't make any sense to me at all:

['WORD','x'], ['KEYWORD','select'], ['WORD','x']

This is just so messed up.

update I just read cmarcelo's reply... You want me to swap the rules? OK...

use HOP::Lexer 'string_lexer'; my $text = 'select xselectx'; my $lexer = string_lexer( $text, [WORD => qr/\w+/ ], [KEYWORD => qr/select/i], );
Outcome:
['WORD','select'], ' ', ['WORD','xselectx']
No good.

Replies are listed 'Best First'.
Re^2: HOP::Lexer not doing what I expected
by cmarcelo (Scribe) on Nov 11, 2006 at 22:11 UTC
    Sorry, but I don't get what's the problem.
    [KEYWORD => qr/select/i], [WORD => qr/\w+/ ],

    What were you expecting exactly to have as result for the rules above for the string xselectx? Are you expecting to deal with word boundaries, like not matching KEYWORD only when it's separated by spaces or something, so doesn't match xselectx?

    And according to my explanation, this is the right order, since WORD matches whatever KEYWORD matches, but KEYWORD is more specific, so goes up.

      Word boundaries? Hmm... interesting take. It's not something that's been mentioned in the docs, or in the perl.com article.

      Where it really does go wrong, in my opinion, is that it doesn't make any attempt to try and find a leftmost match. That's what all lexers are supposed to do. So you can rightfully argue that it must find "select" in the string "selectx", it makes no sense to skip the first "x" in "xselectx". No other lexer or parser in the world would do that, not by design.

        The question of word boundaries doesn't show up because the example author uses doesn't need it. So all works fine (at least in the article). But in your example that makes a difference.

        I know very little about lexers, but I agree that using split causes unexpected behavior (not matching the leftmost rule), but has proven useful in the example of the article, where it creates rules only for what matters (ignoring the = symbol, for example). I don't know how hard/easy would be to do that for leftmost rule matching. split use here is convenient.

        And note, I didn't tell that x must be skipped (considered garbage), at least considering the rules I mentioned, but it's matched by WORD, then KEYWORD matches select. HOP::Lexer knows nothing about boundaries, neither give special meaning to \s, you must tell him if you want just match select in " select " or " select, " but not in "selectx".