chon has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I'm working on creating a code syntax conversion script that is filtering from an uncommon OOP language to C++.

Right now the filter is working okay - but I have a couple of issues that I'm hoping that someone here can help me with:
1) I am not able to successfully ignore comments
2) I am not able to successfully ignore multiline macros

Here's my general algorithm:
1) slurp in the source code file - into a single string
2) convert syntax
3) write out converted file

Originally - I was pulling the file into an array - and then converting the file line-by-line. This worked pretty well - but had issues with coding styles, where one user would write something like this:
class myclass {

and another might write
class myclass {

So - to get through that I decided to slurp the entire file into a string. This allows me to search for the language legal patterns without making any assumptions about newlines - which are pretty much allowed anywhere.

BUT! With my line-by-line style I could simply skip (using next) any lines that started with //, were between /* .. */, or contained a \ (presumed to be a multiline macro). Now that everything is one long string I'm having trouble figuring out how to do this.

Some specific examples:
Example 1:
A class in my language looks like this:
class foo; blah blah; endclass

Which I convert to something like this:
class foo { blah blah; }
No problem there.
s/\bclass(\s+)(\w+);/class $1 {/g; s/endclass/};/

A macro in my language looks like this:
`define mymacro (blah blah) \ blah \ blah blah \ blah

I need to convert it to:
#define mymacro (blah blah) \ blah \ blah blah \ blah

Problem: sometimes the macro contains code that triggers other filters.
Example:
`define myclassmacro (blah) \ class myclass``blah ... \ blah \ endclass

So I guess my simple questions are:
1) How do I write a regular expression that can ignore a line based on another regular expression?

2) I want to define a regexp for a multline macro as: starts with `define and ends with the first non-escaped newline. I tried:
my $multiline_preprocessor_macro = qr/^(.*?)(?!\\)\n/sm;

Thanks!

"chon"

Replies are listed 'Best First'.
Re: Help Creating a Code Filter
by chromatic (Archbishop) on Feb 25, 2008 at 00:26 UTC

    This is non-trivial with regular expressions, as there's too much statefulness to handle. If you really want to pursue this path, you probably have to use regexes to find individual potential tokens and write your own state machine to handle transitions and backtracking. Once you've done that, you've basically written your own grammar engine.

    I recommend the use of a grammar, whether Parse::RecDecent for Perl 5 or perhaps Parrot's PGE/PCT combination. The latter has an implementation of C99 in progress in languages/c99/ that might be instructive.

      I agree that the Right Thing is to create a grammar and use a real parser, and that Parse::RecDescent is relatively easy to use. However, if this conversion is a one-off thing, or if you can't easily come up with a grammar for the source language (e.g. it's an ad-hoc language), the regexp approach can get the job done. Furthermore, it's easier to ignore parts of the language you don't understand using regexps.

      If you do take the regexp approach, I would suggest doing the conversion in multiple passes, e.g. remove the comments, then convert the small/local constructs, then convert the larger ones. You probably also want to order your patterns from most-specific to most-general. You are lucky that you're running the result through a C++ compiler, since if you mistranslate something, odds are good that the compiler will catch it.

Re: Help Creating a Code Filter
by dragonchild (Archbishop) on Feb 25, 2008 at 01:06 UTC
    To expand on chromatic's comment about "too much statefulness", a parser like this would be built something similar to how strtok() works in C. Basically, you create a regex that contains a list of all valid tokens and you iterate over it with /G. The problem here is that "list of all valid tokens" bit. Because, frankly, some tokens are only tokens in the context of other tokens (usually before, but sometimes after or even elsewhere). So, another approach is to start defining patterns. For example, /(class)\s+(\S+)\s*;(.*)endclass/ (or somesuch). Except, that now becomes recursive and you still have the problem of defining all the legal patterns and keeping track of all the state.

    The pattern approach is probably what I'd start out with if I was undertaking your project and I really didn't feel that I could use a grammar. Do you have a grammar for the original language? Not all languages (for example, Perl) have a grammar.


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Help Creating a Code Filter
by John M. Dlugosz (Monsignor) on Feb 25, 2008 at 06:26 UTC
    I think you want to use a grammar, not ad-hoc string processing. That way only lies madness.
Re: Help Creating a Code Filter
by toolic (Bishop) on Feb 25, 2008 at 16:24 UTC
Re: Help Creating a Code Filter
by chon (Initiate) on Feb 25, 2008 at 20:36 UTC
    Madness indeed!

    Thanks for all of the advice.

    I ended up returning to my more intuitive line-by-line approach - slurping the file into an array, rather than into a string. Only a few of the operators require stateful stuff - where I now look ahead in the array for state affecting keywords until I find the terminating semicolon to make a decision for the line I'm on. This seems to work acceptably - so far.

    Thanks again,

    -"chon"