Jim has asked for the wisdom of the Perl Monks concerning the following question:

On this day, I'm thankful for PerlMonks. My brethren in the monastery have been very helpful to me this year and I'm grateful for their generosity, kindness and alacrity.

I have a Perl style question about parsing using a regular expression pattern. I'm matching mutually exclusive alternatives and then testing which of the four alternatives matched using the defined function and nested conditional statements.

use strict; use warnings; use English qw( -no_match_vars ); my $TOKEN_PATTERN = qr{ ([^\\]) # 1 Literal character (g) | \\u([0-9a-f]{4}) # 2 Universal Character Name (\u263a) | \\(["^\\]) # 3 Literal character escape sequence (\") | \\([tnfr]) # 4 Control code escape sequence (\n) }x; my %CONTROL_CODE = ( t => 0x09, n => 0x0a, f => 0x0c, r => 0x0d, ); while (my $line = <>) { chomp $line; while ($line =~ m/$TOKEN_PATTERN/g) { my $token = $LAST_PAREN_MATCH; # Decode tokens... my $code = defined $1 ? ord $token : defined $2 ? hex $token : defined $3 ? ord $token : defined $4 ? $CONTROL_CODE{$token} : undef ; printf "U+%04x\n", $code if defined $code; } }

Is there a better way to do this? What I'm doing works, but it feels clunky. Any suggestions for improvement?

Happy Thanksgiving!

Replies are listed 'Best First'.
Re: Regex Parsing Style
by ikegami (Patriarch) on Nov 25, 2010 at 23:30 UTC
    my $out;
    for ($in) {
       pos = 0;
       for (;;) {
          if (/\G ([^\\]+)            /xsgc) { $out .= $1; }
          if (/\G \\u([0-9a-fA-F]{4}) /xsgc) { $out .= chr(hex($1));      next; }
          if (/\G \\([tnfr])          /xsgc) { $out .= $CONTROL_CODE{$1}; next; }
          if (/\G \\(.)               /xsgc) { $out .= $1;                next; }
          if (/\G \z                  /xsgc) {                            last; }
          die;  # Ends with unescaped "\".
       }
    }
    
    printf("U+%04x\n", ord($_)) for $out =~ /(.)/sg;
    

    Update: Fixed arrangement of conditions.
    Update: Changed /\Z/ to /\z/.

      Thank you, ikegami.

      I think the ordering of alternatives is off. The third and fourth alternatives can never be reached after the first and second alternatives. Or am I missing something?

        that's a heuristic you need to work out. if the alternatives are not exclusive, then the leftest match will always match first. if it was me, i'd run each regex separately and additively collect flags for any matches. then at the end of this parsing you can decide exactly what you want to happen based on combinations of flags..possibly in a switch construct.
        the hardest line to type correctly is: stty erase ^H
        Bug. Fixed. Thanks.