in reply to Re: Regex Parsing Style
in thread Regex Parsing Style

Thank you, ikegami.

I think the ordering of alternatives is off. The third and fourth alternatives can never be reached after the first and second alternatives. Or am I missing something?

Replies are listed 'Best First'.
Re^3: Regex Parsing Style
by aquarium (Curate) on Nov 26, 2010 at 03:12 UTC
    that's a heuristic you need to work out. if the alternatives are not exclusive, then the leftest match will always match first. if it was me, i'd run each regex separately and additively collect flags for any matches. then at the end of this parsing you can decide exactly what you want to happen based on combinations of flags..possibly in a switch construct.
    the hardest line to type correctly is: stty erase ^H
      that's a heuristic you need to work out.

      Actually, there was a bug, which ikegami quickly fixed. In the original version of his lexing code, the first two alternative patterns matched every possible valid, non-empty string, making the remaining two alternative patterns unreachable.

      if the alternatives are not exclusive, then the leftest match will always match first.

      I explained the options are mutually exclusive in my original post. It's important that each alternative pattern matches one and only one class of token.

        i believe you that the options are mutually exclusive, but i never program as such. it's difficult to guarantee non-trivial regexes will indeed match exclusively on all input data. and hence i would either pre-run all regexes or other such programming to eliminate non-exclusivity OR allow all regexes to match against input and make normal logic decisions + sane decisions on possible anomalies. that's the kind of defensive programming i'd do if time allows. it's always a balancing act in handling program input, but i think a little scepticism in programming itself is a good thing. hence my advice...even though it's not inline with the spec. take it or leave it as you please.
        the hardest line to type correctly is: stty erase ^H
Re^3: Regex Parsing Style
by ikegami (Patriarch) on Nov 26, 2010 at 00:39 UTC
    Bug. Fixed. Thanks.

      This is still a paired-down version of my actual script, but it more accurately represents what I'm really doing: counting characters.

      use strict; use warnings; my %CONTROL_CODE = ( '\t' => 0x09, '\n' => 0x0a, '\f' => 0x0c, '\r' => 0x0d, ); my %character_count_by; while (<>) { chomp; pos = 0; TOKEN: while (1) { # Literal character if (m/\G ([^\\]) /gcx) { $character_count_by{ord $1}++; next TOKEN; } # Universal Character Name if (m/\G \\u([0-9a-f]{4}) /gcx) { $character_count_by{hex $1}++; next TOKEN; } # Literal character escape sequence if (m/\G \\(["^\\]) /gcx) { $character_count_by{ord $1}++; next TOKEN; } # Control code escape sequence if (m/\G (\\[tnfr]) /gcx) { $character_count_by{$CONTROL_CODE{$1}}++; next TOKEN; } # End of string if (m/\G \z /gcx) { last TOKEN; } # Invalid character die "Invalid character on line $. of file $ARGV\n"; } } for my $code (sort { $a <=> $b } keys %character_count_by) { printf "U+%04x\t%d\n", $code, $character_count_by{$code}; }

      UPDATE: Changed \Z to \z and updated error message of event that can never happen.

        /\Z/ should be /\z/ (my fault), and you shouldn't have kept the "chr".

        /\\(["^\\])/ looks buggy. Are you sure those are the only symbols that can be escaped? If it's not buggy, you'll need to adjust the error message since no case will handle '\#', for example.