in reply to need regex help to strip things like embedded C comments

Frankly, the way to handle this is with a parser -- something that, in effect, marches through character by character, maintains state information, and gives back chunks of the data with categorizations that you want: comment vs. not-comment. (Since it takes two characters to know you've entered or left a comment, the parser needs to know to look for the second character when it sees the first.)

The state information you need to maintain in this case is the alternation among "not-in-comment-or-quote", "in-quote", and "in-comment". You start out in the first of those, and as soon as you enter either of the others (by detecting an open-quote or open-comment), nothing else matters until you detect the character (pair) that takes you out of that state, putting you back to "not-in-comment-or-quote".

So look at Parse::RecDescent -- I suspect that someone has already come up with a parser spec to handle C-like comments.

  • Comment on Re: need regex help to strip things like embedded C comments

Replies are listed 'Best First'.
Re^2: need regex help to strip things like embedded C comments
by almut (Canon) on Jul 22, 2007 at 01:12 UTC
    ... I suspect that someone has already come up with a parser spec to handle C-like comments.

    FWIW, Parse::RecDescent comes with a little demo script (demo_decomment_nonlocal.pl), which does about that. It doesn't handle nested comments though (just like C/C++), but it could certainly be extended to handle that, too...

Re^2: need regex help to strip things like embedded C comments
by Eradicatore (Monk) on Jul 22, 2007 at 00:42 UTC
    Thanks all. I agree, a parser can do this. But I figured regex *may* be powerful enough to do it also. I'm pretty darn good at regex, but this one was beyond me.

    I did look at c::scan, but it didn't run for me on windows based (activestate) perl and had about NIL documentation so I skipped that.