in reply to Regex Parsing Style

my $out;
for ($in) {
   pos = 0;
   for (;;) {
      if (/\G ([^\\]+)            /xsgc) { $out .= $1; }
      if (/\G \\u([0-9a-fA-F]{4}) /xsgc) { $out .= chr(hex($1));      next; }
      if (/\G \\([tnfr])          /xsgc) { $out .= $CONTROL_CODE{$1}; next; }
      if (/\G \\(.)               /xsgc) { $out .= $1;                next; }
      if (/\G \z                  /xsgc) {                            last; }
      die;  # Ends with unescaped "\".
   }
}

printf("U+%04x\n", ord($_)) for $out =~ /(.)/sg;

Update: Fixed arrangement of conditions.
Update: Changed /\Z/ to /\z/.

Replies are listed 'Best First'.
Re^2: Regex Parsing Style
by Jim (Curate) on Nov 26, 2010 at 00:35 UTC

    Thank you, ikegami.

    I think the ordering of alternatives is off. The third and fourth alternatives can never be reached after the first and second alternatives. Or am I missing something?

      that's a heuristic you need to work out. if the alternatives are not exclusive, then the leftest match will always match first. if it was me, i'd run each regex separately and additively collect flags for any matches. then at the end of this parsing you can decide exactly what you want to happen based on combinations of flags..possibly in a switch construct.
      the hardest line to type correctly is: stty erase ^H
        that's a heuristic you need to work out.

        Actually, there was a bug, which ikegami quickly fixed. In the original version of his lexing code, the first two alternative patterns matched every possible valid, non-empty string, making the remaining two alternative patterns unreachable.

        if the alternatives are not exclusive, then the leftest match will always match first.

        I explained the options are mutually exclusive in my original post. It's important that each alternative pattern matches one and only one class of token.

      Bug. Fixed. Thanks.

        This is still a paired-down version of my actual script, but it more accurately represents what I'm really doing: counting characters.

        use strict; use warnings; my %CONTROL_CODE = ( '\t' => 0x09, '\n' => 0x0a, '\f' => 0x0c, '\r' => 0x0d, ); my %character_count_by; while (<>) { chomp; pos = 0; TOKEN: while (1) { # Literal character if (m/\G ([^\\]) /gcx) { $character_count_by{ord $1}++; next TOKEN; } # Universal Character Name if (m/\G \\u([0-9a-f]{4}) /gcx) { $character_count_by{hex $1}++; next TOKEN; } # Literal character escape sequence if (m/\G \\(["^\\]) /gcx) { $character_count_by{ord $1}++; next TOKEN; } # Control code escape sequence if (m/\G (\\[tnfr]) /gcx) { $character_count_by{$CONTROL_CODE{$1}}++; next TOKEN; } # End of string if (m/\G \z /gcx) { last TOKEN; } # Invalid character die "Invalid character on line $. of file $ARGV\n"; } } for my $code (sort { $a <=> $b } keys %character_count_by) { printf "U+%04x\t%d\n", $code, $character_count_by{$code}; }

        UPDATE: Changed \Z to \z and updated error message of event that can never happen.