comment on

For parsing the entire C language, you probably should read up on the different parsing algorithms and tools available. For instance, LR-1 table parsers or Recursive Descent parsers. (The second are much easier to understand and write, but can get fairly verbose and slow for a language as complex as C)

But as for the best way to do it in perl.... you should get more familiar with the regex engine. I'm not going to prove or disprove this, but it would not surprise me at all if it were possible to parse the entire C language with a single regex. The main reason to use regexes are speed and (as counter-intuitive as this may sound) code clarity. The regex will end up being awkward and ugly and hard to understand, but not nearly as hard to understand as the sort of code you're writing above when expanded out to the entire C language. As for speed, procedural perl that iterates character by character and inflates each one into a string is going to be extremely slow, where the regex engine is fairly fast, even compared to C.

So, here is an example that parses C strings (according to your code; I didn't read the spec) using nothing but a regular expression:

Note that the /x flag on a regex lets you add whitespace and comments.

use v5.36;
my %named_char= (
  a => "\a", b => "\b", t => "\t", n => "\n", v => chr(11), f => "\f"
);
my $line= <STDIN>;
if ($line =~ /
  ^"(           # starts with doublequote
    (?>               # no backtracking
      [^"\\]+             # allow any character other than \\ or "
    | \\[^xuU0-7]         # escaped character other than \x \u \U \0-7
    | \\[0-7]+            # octal escape
    | \\x[0-9a-fA-F]{1,2} # hex escape
    | \\u[0-9a-fA-F]{1,4} # unicode escape
    | \\U[0-9a-fA-F]{1,8} # 
    )*                # repeat as needed
  )"         # stop at next doublequote
/x) {
  my $literal= $1;
  my $op;
  $literal =~ s/\\(
     [^xuU0-7]          (?{ $op=sub{ $named_char{$1} || $1 } })
   | [0-7]+             (?{ $op=sub{ chr oct $1 } })
   | x[0-9a-fA-F]{1,2}  (?{ $op=sub{ chr hex substr $1,1 } })
   | u[0-9a-fA-F]{1,4}  (?{ $op=sub{ chr hex substr $1,1 } })
   | U[0-9a-fA-F]{1,8}  (?{ $op=sub{ chr hex substr $1,1 } })
   ) / &$op /xge;
  use DDP; p($literal);
}
[download]

That last bit is using special features of the perl regex engine; (see perldoc perlre) The syntax (?{ ... }) runs a bit of perl at the moment the regex engine successfully matches up to it. In this case it just changes which subroutine is stored in $op which doesn't execute until the pattern finishes matching. When it does, the regex /e switch causes it to evaluate &$op which runs the subroutine stored in $op, which returns a string that replaces the escape sequence. The /g switch replaces all occurrences in the string, and then your literal is ready to use. I print it out with Data::Printer (shorthand DDP).

In reply to Re: C style strings by NERDVANA
in thread C style strings by harangzsolt33

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.