harangzsolt33 has asked for the wisdom of the Perl Monks concerning the following question:

I have written the following code which interprets a C-style string. Is there a better way to do this? I have run into so many obstacles. It's incredible how such a simple task can be so complicated. Anyways, I want to know if \u and \U should print the number in little-endian order or big-endian order. What does the specification say? I couldn't find it in Wiki or anywhere.

#!/usr/bin/perl -w use strict; use warnings; my $STRING = '\r\n\t\tHello\tWorld\uFF00\\0\\n\\n\\!!\\"--->\\xf3\\"ab +c\\x4\\2\\8\\2'; print "\n\n$STRING\n\n"; $STRING = DecodeCString($STRING); HexDump($STRING); exit; ################################################## # String | v2022.12.31 # This function interprets a C style string and # returns its value. A C string may contain escape # sequences such as \r \n \t \0 \xFF \" and so on. # This function decodes these escape sequences and # returns the resulting string. # # If an incomplete escape sequence is found, it # will be ignored. For example, in "\x=Z" the # "\x" should be followed by a hexadecimal number. # In this case, "\x" will be ignored and "=Z" # will be written to the output. No error # messages will be displayed at all. # # Usage: STRING = DecodeCString(STRING) # sub DecodeCString { defined $_[0] or return ''; # If the input string contains no backslash at all, # then we just return it as we found it: if (index($_[0], "\\") < 0) { return $_[0]; } my $OUTPUT = ''; my $L = length($_[0]); my ($N, $LEN, $OCT, $START, $EXPECT) = (0) x 5; my $LAST = $L - 1; for (my $i = 0; $i < $L; $i++) { if (++$N == 1) # Find next backslash when ++N == 1 { $i = index($_[0], "\\", $i); # Find next backslash if ($i < 0) { $OUTPUT .= substr($_[0], $START); last; } if ($i > $START) { $OUTPUT .= substr($_[0], $START, $i - $START); $START = $i; } next; } if ($N == 2) # Read first character following a backslash: { $EXPECT = 0; # Maximum number of digits we're expecting $LEN = 0; # How many digits we got so far $OCT = 1; # 0=Hexadecimal number, 1=Octal number $START = $i + 1; # digits begin with the next character my $C = substr($_[0], $i, 1); my $P = index('01234567abtnvfrexuU', $C); # Escape codes if ($P < 0) { $N = 0; $OUTPUT .= $C; } # Write Literal elsif ($P < 8) { $EXPECT = 3; $LEN = 1; $START--; } # 0-7 octal elsif ($P < 15) { $N = 0; $OUTPUT .= chr($P - 1); } # abtnvfr elsif ($P == 15) { $N = 0; $OUTPUT .= "\x1B"; } # e elsif ($P == 16) { $EXPECT = 2; $OCT = 0; } # x elsif ($P == 17) { $EXPECT = 4; $OCT = 0; } # u elsif ($P == 18) { $EXPECT = 8; $OCT = 0; } # U # The following next statement must be conditional, because # if the string ends with an incomplete octal number such as # "\0" we must trickle down and write it to output instead # of trying to reach for the next non-existent digit: $OCT && $i == $LAST or next; } # Subsequent characters following a backslash are processed here: if ($EXPECT) { my $C = substr($_[0], $i, 1); my $P = index('0123456789ABCDEFabcdef', $C); # Check digits if ($P >= 0) { $LEN++ } # Count it if it's a valid digit. # If we encounter an 'x' or digit '8' while reading an octal # number, it signals the end of the number. Or if we're # reading a hexadecimal number such as \x3 but then it's # immediately followed by the letter 'z' then we know that # '3' is the only digit we got. # $UNEXPECTED will be true if we got an unexpected character: my $UNEXPECTED = ($P < 0 || ($OCT && $P > 7)); # $END_OF_SEQ will be true if we either encountered an # unexpected character OR we have read all of expected digits # OR we have reached the end of the input string: my $END_OF_SEQ = $UNEXPECTED || $LEN == $EXPECT || $i == $LAST; if ($END_OF_SEQ) { # If "\x" is immediately followed by something other than # hexadecimal digits, then we abandon ship and ignore it. # So, here we check $LEN to see if we got any valid digits # so far. If not, then we don't have to write anything. #print "\n<$LEN> ", substr($_[0], $START, $LEN); #print " $C = substr($_[0], $START, $LEN); if ($LEN) { $C = substr($_[0], $START, $LEN); $C = ($OCT) ? oct($C) : hex($C); # "\xFF" produces one byte, but "\u1234" will produce a # 2-byte output in big-endian order, and "\U12345678" # will produce a 4-byte output. $OUTPUT .= pack(substr('CCnnN', ($EXPECT >> 1), 1), $C); } $START = $i; $EXPECT = $OCT = $LEN = $N = 0; if ($UNEXPECTED) { $i--; } else { $START++; } } } } return $OUTPUT; } ################################################## # String | v2022.11.14 # This function prints the contents of a string # in hexadecimal format and plain text along with # the address. A second argument may be provided # to limit the number of bytes to be printed. # # Usage: HexDump(STRING, [LIMIT]) # sub HexDump { defined $_[0] or return 0; my $LIMIT = defined $_[1] ? $_[1] : length($_[0]); if ($LIMIT > length($_[0])) { $LIMIT = length($_[0]); } $| = 1; my $PTR = 0; my $ROWS = int(($LIMIT + 15) / 16); while ($ROWS--) { my $LINE = sprintf("\n %0.8X:", $PTR) . (' ' x 69); my ($CP, $NP, $CC) = (63, 13, 16); while ($CC--) { my $c = vec($_[0], $PTR++, 8); substr($LINE, $NP, 2) = sprintf('%0.2X', $c); vec($LINE, $CP++, 8) = ($c < 32 || $c > 126) ? 46 : $c; $NP += 3; if ($CC == 7) { vec($LINE, 36, 8) = 45; } if ($PTR >= $LIMIT) { $ROWS = 0; last; } } print $LINE; } print "\n"; return 1; } ##################################################



The above perl script produces the following:




\r\n\t\tHello\tWorld\uFF00\0\n\n\!!\"--->\xf3\"abc\x4\2\8\2


 00000000:  0D 0A 09 09 48 65 6C 6C-6F 09 57 6F 72 6C 64 FF   ....Hello.World.
 00000010:  00 00 0A 0A 21 21 22 2D-2D 2D 3E F3 22 61 62 63   ....!!"--->."abc
 00000020:  04 02 38 02                                       ..8.

Replies are listed 'Best First'.
Re: C style strings
by jwkrahn (Abbot) on Dec 31, 2022 at 18:47 UTC

    Back when I was using C a lot (before Unicode) strings were just arrays of char which are 8 bit unsigned integers.

    So a 12 character string like "Hello world\n" is the 13 element array 72 101 108 108 111 32 119 111 114 108 100 10 0

    The Linux ascii man page lists the C escape sequences:

        007   7     07    BEL '\a' (bell)
        010   8     08    BS  '\b' (backspace)
        011   9     09    HT  '\t' (horizontal tab)
        012   10    0A    LF  '\n' (new line)
        013   11    0B    VT  '\v' (vertical tab)
        014   12    0C    FF  '\f' (form feed)
        015   13    0D    CR  '\r' (carriage ret)
    

    Which are the same in Perl except for \v which Perl doesn't use.

    Naked blocks are fun! -- Randal L. Schwartz, Perl hacker

      G'day jwkrahn,

      The following is just a point of clarification; it's not intended as a correction.

      "Which are the same in Perl except for \v which Perl doesn't use."

      Actually, Perl does use \v; you can't use it to insert a VT, but you can use it to detect one. See "perlrebackslash: All the sequences and escapes":

      ... \v Match any vertical whitespace character. ...

      This was introduced in Perl v5.10: "perl5100delta: Vertical and horizontal whitespace, and linebreak".

      I don't know for certain, but I imagine this is unavailable in the "TinyPerl 5.8" being used by the OP.

      $ perl -we 'print "abc\vdef\n"' Unrecognized escape \v passed through at -e line 1. abcvdef $ perl -we 'print "abc\x{0b}def\n"' abc def

      — Ken

Re: C style strings
by LanX (Saint) on Dec 31, 2022 at 10:59 UTC
    I'm not sure how you define C-style string, do you have a proper specification?

    Anyway, if your main worry is that Perl is doing interpolation, you could try to escape the $ and @ in the string

    Though I have to admit I had problems with your full string so I shortened it...

    This is only a stub, one would have probably also have to deal with already escaped $ and @ too.

    use v5.12; use warnings; use Data::Dump qw/pp dd/; my $scl ="XXX"; my @arr = 1..3; # my $raw = <<'___'; # \r\n\t\tHello\tWorld\uFF00\\0\\n\\n\\!!\\"--->\\xf3\\"abc\\x4\\2\\8\ +\­2 # ___ my $raw = <<'___'; \r\n\t\tHello\tWorld $scl@arr ___ chomp $raw; dd $raw; $raw =~ s/(\$|\@)/\\$1/g; my $str = eval qq("$raw"); dd $str; # probably better done with unpack printf "%02X ",ord($_) for split //,$str;

    "\\r\\n\\t\\tHello\\tWorld\n\$scl\@arr" "\r\n\t\tHello\tWorld\n\$scl\@arr" 0D 0A 09 09 48 65 6C 6C 6F 09 57 6F 72 6C 64 0A 24 73 63 6C 40 61 72 7 +2

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery

      I should have mentioned that I am trying to parse a C source code, and extract only the values. But the eval solution looks really clever! The only concern I have is what are the exact differences in C and Perl specification? Just the $ and the @ ? That's all? I was trying to go by the Wikipedia description of C escape sequences and how they work.
        > Just the $ and the @ ? That's all?

        I don't know, but it's unlikely.

        I suppose Perl augmented it's "escapism" over the versions.

        More importantly, what are the specifications for C and do they also depend on the version?

        For instance I had problems with the \u , and if it comes to Unicode inconsistencies then problems will be even manifold.

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery

      «…do you have a proper specification?»

      Since I had none at hand I took the here:

      "In C, the string data type is modeled on the idea of a formal string (Hopcroft 1979):

      Let Σ be a non-empty finite set of characters, called the alphabet.
      A string over Σ is any finite sequence of characters from Σ.
      For example, if Σ = {0, 1}, then 01011 is a string over Σ."

      Seacord, R. C. (2020) EFFECTIVE C An Introduction to Professional C Programming no starch press

      Regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

        Well I was referring to Escape sequences in C, i.e. specifically in "C-style strings" °

        NB: the WP article doesn't list many references, other pages talk about variations depending on the compiler.

        Another point seems to be the distinction between various C versions like C99 which introduced \uXXXX ...

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery

        °) not the mathematical definition of a byte vector.

Re: C style strings
by Anonymous Monk on Dec 31, 2022 at 09:12 UTC
Re: C style strings
by karlgoethebier (Abbot) on Dec 31, 2022 at 16:38 UTC

    I don't know if Convert::Binary::C is what you are looking for. But maybe it's at least somewhat inspiring - who knows.

    «The Crux of the Biscuit is the Apostrophe»

Re: C style strings
by kcott (Archbishop) on Jan 01, 2023 at 02:29 UTC

    G'day harangzsolt33,

    You mentioned that you'd used a Wikipedia description, but gave no details. The following is based on "Wikipedia: Escape sequences in C: Table of escape sequences" and "Wikipedia: Digraphs and trigraphs: C".

    In many cases, no conversion is required: Perl & C use the same escapes (\n, \t, and so on). For the remainder, the following code should, I believe, cover all cases.

    #!/usr/bin/env perl use strict; use warnings; my $c_str = q{BEL \a BS \b ESC \e FF \f NL \n CR \r TAB \t}; $c_str .= q{ VT \v BSLASH \\\\ APOS \' QUOT \" HASH \43}; $c_str .= q{ DOLLAR \x24 AT \u0040 TILDE \U0000007e}; $c_str .= q{ TGs \?= \?/ \?' \?( \?) \?! \?< \?> \?-}; print "\$c_str[$c_str]\n"; my $p_str = c2p($c_str); print "\$p_str[$p_str]\n"; { my %trigraph; BEGIN { no warnings 'qw'; %trigraph = qw{= # / \ ' ^ ( [ ) ] ! | < { > } - ~}; } sub c2p { my ($str) = @_; $str =~ s/\\U([0-9A-Fa-f]{8})/\\x{$1}/g; $str =~ s/\\u([0-9A-Fa-f]{4})/\\x{$1}/g; $str =~ s/\\x([0-9A-Fa-f]+)/\\x{$1}/g; $str =~ s/\\([0-7]{1,3})/0$1/g; $str =~ s/\\\?([=\/'\(\)!<>-])/$trigraph{$1}/g; $str =~ s/\\e/\\c[/g; $str =~ s/\\v/\\x{0b}/g; return $str; } }

    As we've discussed previously, you're using "TinyPerl 5.8". I've kept the code as simple as possible to accommodate that; but I've no way of testing it.

    Output:

    $c_str[BEL \a BS \b ESC \e FF \f NL \n CR \r TAB \t VT \v BSLASH \\ AP +OS \' QUOT \" HASH \43 DOLLAR \x24 AT \u0040 TILDE \U0000007e TGs \?= + \?/ \?' \?( \?) \?! \?< \?> \?-] $p_str[BEL \a BS \b ESC \c[ FF \f NL \n CR \r TAB \t VT \x{0b} BSLASH +\\ APOS \' QUOT \" HASH 043 DOLLAR \x{24} AT \x{0040} TILDE \x{000000 +7e} TGs # \ ^ [ ] | { } ~]

    I haven't used trigraphs beyond reading about them long ago. The output I've produced (TGs ...) seems reasonable; however, I wasn't sure how you wanted to handle these. Do feel free to get alternative advice in this area.

    What I've provided is a direct conversion; e.g. "VT \v" to "VT \x{0b}". If instead, you wanted "to VT <actual vertical tab character>", just remove one backslash: i.e. change "s/\\v/\\x{0b}/g" to "s/\\v/\x{0b}/g". I'm pretty sure that will work for most of the substitutions. For octal elements: "s/\\([0-7]{1,3})/0$1/g" to "s/\\([0-7]{1,3})/chr oct $1/eg". You'll also need to add substitutions like "s/\\n/\n/g". Given I don't even know if that's what you want, I didn't spend much time looking into that aspect.

    — Ken

Re: C style strings
by NERDVANA (Priest) on Jan 02, 2023 at 04:51 UTC
    For parsing the entire C language, you probably should read up on the different parsing algorithms and tools available. For instance, LR-1 table parsers or Recursive Descent parsers. (The second are much easier to understand and write, but can get fairly verbose and slow for a language as complex as C)

    But as for the best way to do it in perl.... you should get more familiar with the regex engine. I'm not going to prove or disprove this, but it would not surprise me at all if it were possible to parse the entire C language with a single regex. The main reason to use regexes are speed and (as counter-intuitive as this may sound) code clarity. The regex will end up being awkward and ugly and hard to understand, but not nearly as hard to understand as the sort of code you're writing above when expanded out to the entire C language. As for speed, procedural perl that iterates character by character and inflates each one into a string is going to be extremely slow, where the regex engine is fairly fast, even compared to C.

    So, here is an example that parses C strings (according to your code; I didn't read the spec) using nothing but a regular expression:

    Note that the /x flag on a regex lets you add whitespace and comments.

    use v5.36; my %named_char= ( a => "\a", b => "\b", t => "\t", n => "\n", v => chr(11), f => "\f" ); my $line= <STDIN>; if ($line =~ / ^"( # starts with doublequote (?> # no backtracking [^"\\]+ # allow any character other than \\ or " | \\[^xuU0-7] # escaped character other than \x \u \U \0-7 | \\[0-7]+ # octal escape | \\x[0-9a-fA-F]{1,2} # hex escape | \\u[0-9a-fA-F]{1,4} # unicode escape | \\U[0-9a-fA-F]{1,8} # )* # repeat as needed )" # stop at next doublequote /x) { my $literal= $1; my $op; $literal =~ s/\\( [^xuU0-7] (?{ $op=sub{ $named_char{$1} || $1 } }) | [0-7]+ (?{ $op=sub{ chr oct $1 } }) | x[0-9a-fA-F]{1,2} (?{ $op=sub{ chr hex substr $1,1 } }) | u[0-9a-fA-F]{1,4} (?{ $op=sub{ chr hex substr $1,1 } }) | U[0-9a-fA-F]{1,8} (?{ $op=sub{ chr hex substr $1,1 } }) ) / &$op /xge; use DDP; p($literal); }

    That last bit is using special features of the perl regex engine; (see perldoc perlre) The syntax (?{  ... }) runs a bit of perl at the moment the regex engine successfully matches up to it. In this case it just changes which subroutine is stored in $op which doesn't execute until the pattern finishes matching. When it does, the regex /e switch causes it to evaluate  &$op which runs the subroutine stored in $op, which returns a string that replaces the escape sequence. The /g switch replaces all occurrences in the string, and then your literal is ready to use. I print it out with Data::Printer (shorthand DDP).