huaihai has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I wonder if tehre is already a solution to my task out here. To help instrument my c code, I want to run a perl script which automatically scans the source file, and separates each function definition, maybe putting them into arrays. I am baffled as to what kind of regex will give me such results. In particular, the source file can be badly formated, and the only way the extract a whole function body is by counting the number of matching {, and }. Could anyone point to a resource for doing that?

Replies are listed 'Best First'.
Re: extract C function body
by Fletch (Bishop) on Sep 07, 2006 at 13:00 UTC

    If you can guarantee a fairly rigid formatting you might can get away with using regexen to handle the job. There's also the C::Scan module which I think is what Inline uses (and if it's not, look at Inline::C and see what it does).

Re: extract C function body
by planetscape (Chancellor) on Sep 07, 2006 at 15:36 UTC

    Granted, this may not be at all what you are looking for, but depending on your needs, it might be just the trick.

    Doxygen generates JavaDoc-like documentation for projects in many languages, including C and C++. DoxygenFilter is a new take on an old project (DoxyFilt) that also handles ("filters") Perl code, and now promises support for multi-programming-language projects. See Examples of output generated by doxygen.

    HTH,



    planetscape
Re: extract C function body
by ikegami (Patriarch) on Sep 07, 2006 at 16:11 UTC
Re: extract C function body
by wojtyk (Friar) on Sep 07, 2006 at 15:19 UTC
    I wrote a C/C++ parser that does what you say...increments depth count when a { is reached and decrements on }:

    $depth++ while /\{/g; $depth-- while /\}/g;
    If depth == 0, the line is tested using a regex that matches function declarations. I used the following, although I'm unsure how accurate it matches us to the actual grammar (I rolled it in my head):

    my $funcrgx    = '((\w+(?:\:\:\w+)*)\s*(\([^)]*\)))';

    The one thing you have to be careful of using this method is preprocessor crap. It can throw the depth count off. I wrote in some fuzzy handling code to take care of that (basically always picking the first branch of the #ifdef to follow and not counting parens in the other branch)

    Using established parsing modules is probably preferrable to rolling it yourself, but this is how I did it :)