rodent has asked for the wisdom of the Perl Monks concerning the following question:

Anyone remember the old Unix xstr command? Basically this could extract string literals from a piece of C source code, and put them in a separate header file, replacing the string literals with references to a string array in the header file.

I've been digging high and low for a perl module that is capable of parsing, and modifying C source code "on the fly" and generate a new .c file for use during the compilation process. I use perl quite regularly in my build process and I thought that it would be easily up to the task...

Essentially I'm trying to reproduce the behaviour of the old unix xstr command, but have found VERY little in terms of useful code.

At the moment, I'm using a while(<FILE>) and a silly regex to try to extract and replace the strings, but I'm not getting all syntactical cases of string literals in the C code.

Has anyone attempted this before, or have a port of xstr to perl, or know of decent regex, or perl module that is up to this task. I've dug through CPAN but without much success...

Replies are listed 'Best First'.
Re: Regex to extract/modify C source string literals? (cheap)
by tye (Sage) on May 29, 2003 at 16:48 UTC

    To successfully deal with strings in C code, you'll also need to deal with a few non-string constructs. In this case, the list is rather small since C strings are always delimited by double quotes ("), so you only need to deal with valid C constructs that might contain such a character.

    Characters: '"' Comments: /* " */ C++ comments: // "

    So you can write a cheap little parser to pull out quoted strings:

    my $parser= qr{ \G # Don't skip anything (?: [^'"/]+ # Stuff we don't care about | '(?:[^\\']+|\\.)' # '"', '\'', '\\', 'param' | /\* .*? \*/ # A C comment | //[^\n]+ # A C++ comment | / # /, not a comment, division | "((?:[^\\"]+|\\.)*)" # A quoted string ($1) | (.) # An error ($2) ) }xs; my $code= do { local($/); <CCODE> }; my @strings; while( $code =~ m/$parser/g ) { if( defined $1 ) { push @strings, $1; } elsif( defined $2 ) { my $char= $2; my $pos= pos($code)-5; $pos= 0 if $pos < 0; my $context= substr( $code, $pos, 10 ); warn "Ignoring unexpected character ($char) in ($context)" +; } }
    Then you can extend that to replace strings as well.

    Update: Enlil was kind enough to point out that '[^']+' won't match '\''. I replaced that part. Note that I support the rather strange:

    #define ctrl(char) ( 'char' & 31 )
    which I can't recall whether ANSI C officially allowed or disallowed. (:

    And here is a hint at how to extend it to support replacing strings:

    #!/usr/bin/perl -p0777 -i.org my $parser; BEGIN { $parser= qr{ \G # Don't skip anything ( [^'"/]+ # Stuff we don't care about | '(?:[^\\']+|\\.)' # '"', '\'', '\\', 'param' | /\* .*? \*/ # A C comment | //[^\n]+ # A C++ comment | / # /, not a comment, division | "((?:[^\\"]+|\\.)*)" # A quoted string ($2) | (.) # An error ($3) ) # Entire match ($1) }xs; } s{$parser}{ if( defined $3 ) { my $char= $2; my $pos= pos($code)-5; $pos= 0 if $pos < 0; my $context= substr( $code, $pos, 10 ); warn "Ignoring unexpected character ($char) in ($context)" +; } if( defined $2 ) { my $string= $2; #... manipulate $string ... $string; } else { $1; } }g;

                    - tye

    To see test script,

Re: Regex to extract/modify C source string literals?
by educated_foo (Vicar) on May 29, 2003 at 14:31 UTC
    Regex::Common has a quoted regex which you should be able to tweak to do this. Note that you'll want to put your filehandle in slurp mode, since the C may have multi-line string literals (gcc < 3.3 allows this). Also, you'll need to handle implicit concatenation of adjacent literals.

    /s