C parsing questions

Nkuvu has asked for the wisdom of the Perl Monks concerning the following question:

I have a working script that is parsing some C source files for variable declarations and default values. The majority of the variable types are simple, such as:

int one_variable = 32;
int another_variable = 35;
float some_float;
[download]

Note that any variable that doesn't have an explicit initializer is set to 0/false in another area of the code. The tricky part is the number of structs that have to be caught.

In the code, they have structs such as

typedef struct { 
    float one;
    float two;
    int   three;
    bool  potato;
} struct_name;
[download]

I need to create verification lines for each member of the struct. Our call to the (templated) verify function is pretty simple, with an identifying string, actual and expected values. So given the above struct I'd have four calls to verify, looking like:

verify("foobar.one = 0.0", foobar.one, (float)0.0);
verify("foobar.two = 0.0", foobar.two, (float)0.0);
verify("foobar.three = 0", foobar.three, (int)0);
verify("foobar.potato = false", foobar.potato, (bool)false);
[download]

(assuming the definition of foobar is listed in the source as struct_name foobar;)

This is all in C, obviously, and here is where the Perl comes into play. I have defined a hash of arrays of arrays for the structs, and a simple has for the typical data types. So I have:

my %default_values = ( 'int' => '0',
                       'float' => '0.0',
                       'bool' => 'false' );
# Oooog:
my %structs = (
               'struct_name' =>
               [
                [ 'float', 'one', '0.0' ],
                [ 'float', 'two', '0.0' ],
                [ 'int', 'three', '0' ],
                [ 'bool', 'potato', 'false' ]
               ]
              );

# more stuff here, then when I am checking the variable type:

if (exists $default_values{$type}) {
    print $output "        verify(\"$short_var = $default_values{$type
+}\",\n";
    print $output "               $variable"."[i],\n";
    print $output "               ($type)$default_values{$type});\n";
}
elsif (exists $structs{$type}) {
    my $array_ref = $structs{$type};
    foreach my $object_ref (@{$array_ref}) {
        print $output "    verify(\"$short_var.@{$object_ref}[1] = @{$
+object_ref}[2]\",\n";
        print $output "           $variable.@{$object_ref}[1],\n";
        print $output "           (@{$object_ref}[0])@{$object_ref}[2]
+);\n";
    }
}
[download]

Some notes: I'm using strict and warnings. This is working fine, variables not explicitly mentioned in this note have been defined elsewhere (like $short_var and $variable). All data has been sanitized to protect the innocent.

So I'm trying to think of a way to clean this up so that it's more maintainable, and so that someone looking at it who doesn't know Perl might have some possibility of seeing what's going on. This may include more comments, reorganization of the code, or (what I suspect) a different way to store information about each struct. I was thinking briefly about having a hash with all of the data types that we use defined, whether they're C primitives, structs, or something else. But with primitive data types getting a different format of the verify call (in addition to the varying number of varify calls) it seemed easier to just have two hashes. The icing on the cake is that some of the declarations are arrays, so I need to make sure to take that into account.

As I said, I have something working. But in case these values change, I want to have this script available to other people, and therefore I need to make it a lot more clear about how it's working. The brief snippets shown here are a distilled sample of the "I meant this as a throwaway script" mess.

One idea that's about half-baked in my head is to have the %structs hash be a hash of hashes, with struct_name as the primary key (for lack of a better term) each member of said struct next, followed by its data type. The problem there is that I would like the order of the members to be the same as it's defined in the header file. But any suggestions how to simplify the data types or clean up the code would be vastly appreciated.

One final thing to mention is that I've defined the default values for ints and floats as strings ('0' and '0.0', respectively). I did this because I want to make sure that in the verify call it has 0 or 0.0 to match the data type. This isn't a huge deal, since I'm casting the expected value anyway, and 0 is the same as 0.0, I just did it for clarity in the output test file. Comments on this are also welcome -- I'm not sure if it's a huge boon to have potential calls like verify("foo = 0.0", foo, (float)0.0) versus something like verify("foo = 0", foo, (float)0).

Comment on C parsing questions Select or Download Code

Replies are listed 'Best First'.
Re: C parsing questions by jmcnamara (Monsignor) on Nov 28, 2005 at 22:05 UTC
I'm not sure if it addresses your exact needs since it deals with parsing declarations rather than values, however you may find the following module useful Convert::Binary::C. -- John.	[reply]
Re^2: C parsing questions by Nkuvu (Priest) on Nov 29, 2005 at 00:30 UTC
The module does look very useful, but unfortunately I'm not able to install modules for my scripts to use. Well, more of a guideline than a requirement. The scripts I'm writing will be used on a variety of different machines, which may or may not have anything other than core modules installed (it's almost always limited to just core modules, using ActiveState Perl). So if a module is pure Perl, I can copy the relevant sections into my script (making sure to note where the code came from, of course) and make it stand-alone. But anything that requires a compiler to install is a pretty good guarantee that my script won't be usable by anyone else.	[reply]
Re^3: C parsing questions by davidrw (Prior) on Nov 29, 2005 at 02:28 UTC
what about packaging the dependencies with PAR?	[reply]
Re: C parsing questions by GrandFather (Saint) on Nov 28, 2005 at 22:06 UTC
You current HoA data structure looks appropriate. However, for the use of others you might like to put the structure definitions in a __DATA__ section and build the internal representation at run time. That way new structures can be pretty much just copied and pasted onto the end of the script and the default supplied or not as required. `__DATA__ typedef struct { float one; /* 0.0 / float two; / 0.0 / int three; / 0 / bool potato; / false */ } struct_name;` [download] DWIM is Perl's answer to Gödel	[reply] [d/l]
Re^2: C parsing questions by Nkuvu (Priest) on Nov 29, 2005 at 00:22 UTC
I really like this approach, thanks for suggesting it. I've had to go in and modify some Python scripts to match changes in the source code, and it's usually quite a pain to find out where things are being set/called/parsed/whatever. Anything I can do to make it easier will definitely be appreciated by non-Perl programmers. Of course since it's Monday (that's my excuse and I'm sticking to it) I still feel like I'm writing very messy code. Any suggestions on the added sub would be appreciated. Also note that the header files are auto-generated by a tool we're using, so the format of the struct definitions is always the same. And my subroutine takes this into account -- it fails horribly with comments in the code, but works just fine for the "live" code. use strict; use warnings; use Data::Dumper; my %default_values = ( 'float' => 0, 'int' => 3, # Unique value for visibility durin +g testing 'bool' => 'false' ); my %structs; parse_struct_definitions(); print Dumper(%structs); sub parse_struct_definitions { # Reads the typedef struct lines in the __DATA__ section to popula +te the # %structs hash. Created for simple updates to the defined struct +ures # (simply copy and paste from the header files into the DATA secti +on # below) local $/ = 'typedef struct {'; while (my $line = <DATA>) { chomp $line; next if $line !~ /\w/; # For me to parse out the data more easily: $line =~ s/\n/ /g; $line =~ s/\s+/ /g; # Break the line into members and the struct name my ($member_string, $name) = $line =~ /([^\}]+)\s\}\s(.+)/; my @members = split ';', $member_string; $name =~ tr/; //d; foreach my $member (@members) { next if $member !~ /\w/; my ($type, $member_name) = split " ", $member; push @{$structs{$name}}, [ $type, $member_name, $default_v +alues{$type} ]; } } } # end of parse_struct_definitions __DATA__ typedef struct { float one; float two; int three; bool potato; } struct_name; typedef struct { float one; /* 0.0 / float two; / 0.0 / int three; / 0 / bool potato; / false */ } struct_name_with_comments; [download]	[reply] [d/l]
Re^3: C parsing questions by GrandFather (Saint) on Nov 29, 2005 at 00:38 UTC
If you change your data to: `typedef struct { float one; float two; int three; bool potato; } struct_name; typedef struct { float one; /* 1.1 / float two; / 2.2 / int three; / 3 / bool potato; / true / } struct_name_with_comments;` [download] You get the following (wrapped to compress): `$VAR1 = 'struct_name_with_comments'; $VAR2 = [ ['float', 'one', 0], ['/', '1.1', undef], ['/', '2.2', undef], ['/', '3', undef], ['/*', 'true', undef] ]; $VAR3 = 'struct_name'; $VAR4 = [ ['float','one',0], ['float', 'two', 0], ['int', 'three', 3], ['bool', 'potato', 'false'] ];` [download] There appear to be bugs :). Given you are `$line =~ s/\n/ /g;` and `$line =~ s/\s+/ /g;` you could just `$line =~ s/[\n\s]+/ /g;`. <>.The use of `$/` is nice. Good to see someone remembering it's there. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^4: C parsing questions by Nkuvu (Priest) on Nov 29, 2005 at 01:42 UTC