Re: C style strings

For parsing the entire C language, you probably should read up on the different parsing algorithms and tools available. For instance, LR-1 table parsers or Recursive Descent parsers. (The second are much easier to understand and write, but can get fairly verbose and slow for a language as complex as C)

But as for the best way to do it in perl.... you should get more familiar with the regex engine. I'm not going to prove or disprove this, but it would not surprise me at all if it were possible to parse the entire C language with a single regex. The main reason to use regexes are speed and (as counter-intuitive as this may sound) code clarity. The regex will end up being awkward and ugly and hard to understand, but not nearly as hard to understand as the sort of code you're writing above when expanded out to the entire C language. As for speed, procedural perl that iterates character by character and inflates each one into a string is going to be extremely slow, where the regex engine is fairly fast, even compared to C.

So, here is an example that parses C strings (according to your code; I didn't read the spec) using nothing but a regular expression:

Note that the /x flag on a regex lets you add whitespace and comments.

use v5.36;
my %named_char= (
  a => "\a", b => "\b", t => "\t", n => "\n", v => chr(11), f => "\f"
);
my $line= <STDIN>;
if ($line =~ /
  ^"(           # starts with doublequote
    (?>               # no backtracking
      [^"\\]+             # allow any character other than \\ or "
    | \\[^xuU0-7]         # escaped character other than \x \u \U \0-7
    | \\[0-7]+            # octal escape
    | \\x[0-9a-fA-F]{1,2} # hex escape
    | \\u[0-9a-fA-F]{1,4} # unicode escape
    | \\U[0-9a-fA-F]{1,8} # 
    )*                # repeat as needed
  )"         # stop at next doublequote
/x) {
  my $literal= $1;
  my $op;
  $literal =~ s/\\(
     [^xuU0-7]          (?{ $op=sub{ $named_char{$1} || $1 } })
   | [0-7]+             (?{ $op=sub{ chr oct $1 } })
   | x[0-9a-fA-F]{1,2}  (?{ $op=sub{ chr hex substr $1,1 } })
   | u[0-9a-fA-F]{1,4}  (?{ $op=sub{ chr hex substr $1,1 } })
   | U[0-9a-fA-F]{1,8}  (?{ $op=sub{ chr hex substr $1,1 } })
   ) / &$op /xge;
  use DDP; p($literal);
}
[download]

That last bit is using special features of the perl regex engine; (see perldoc perlre) The syntax (?{ ... }) runs a bit of perl at the moment the regex engine successfully matches up to it. In this case it just changes which subroutine is stored in $op which doesn't execute until the pattern finishes matching. When it does, the regex /e switch causes it to evaluate &$op which runs the subroutine stored in $op, which returns a string that replaces the escape sequence. The /g switch replaces all occurrences in the string, and then your literal is ready to use. I print it out with Data::Printer (shorthand DDP).

Comment on Re: C style strings Select or Download Code