http://qs1969.pair.com?node_id=350087

oz has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to translate the perl regular expression to xerox regular expression (http://www.xrce.xerox.com/competencies/content-analysis/fsCompiler/fssyntax.html).
I need to replace all [ABCD] kind of regex into A|B|C|D

I would appreciate any help.
thanks in advance

Edit by BazB. add formatting, code tags and linkify URL

Replies are listed 'Best First'.
Re: regular expression-xerox
by diotalevi (Canon) on May 03, 2004 at 18:01 UTC

    You seem to have already figured out that [ABCD] is semi-equivalent to (A|B|C|D). What do you need help with?

    It is more properly equivalent to (?:A|B|C|D). The difference is that (?: ... ) is strictly for grouping and alternation while ( ... ) also captures its contents into a variable that can be accessed with a number like $1, $2, $3, etc.

      Because I need to translate it in another regular expression format.

        I fail to see the problem. What part of the problem involving writing [ABCD] as (A|B|C|D) escapes you?

Re: regular expression-xerox
by Fletch (Bishop) on May 03, 2004 at 18:03 UTC

    Of course to do this you're going to need to parse the perl regexp to begin with. Take a look at YAPE::Regex. That'll get you a parse tree you can walk and munge into your other format.

Re: regular expression-xerox
by kvale (Monsignor) on May 03, 2004 at 18:01 UTC
    Character classes like [ABCD] can be converted to alternation as follows:
    my $class = 'ABCD'; my $xerox = join '|', split //, $class; # create alternation $xerox = '(?:' . $xerox . ')'; # non-capturing grouping

    -Mark

      Not quite. You haven't considered:
      1. Characters that have a special meaning, like -, ^, and ] (and that meaning is position dependent!)
      2. Characters that inside a character class don't have a special meaning, but have one outside the class, like +, ?, * and others.
      3. POSIX character class syntax.

      Abigail

        My solution is correctly answers the particular requirement the OP stated: convert the character class 'ABCD' to a form that uses alternation. If one extrapolates that requirement to all alphanumerics, them my type of solution still works.

        If one exptrapolates to metacharacters like those in 1. and 2., or to predefined POSIX classes or Unicode characters, as in 3., then obviously the parser and translator must be extended to handle these situations.

        But for the simple requirements stated by the OP, a simple solution is best.

        -Mark

      May I ask what does ? mean in the regular expression. I can not use ? in the language I am translating since it has already a meaning- which is any character. One other question can't it be done with a substitution routine because I need to globally change each occurence of [] to | in the regular expression. And one note my character classes include only capital letters as the simplest example i give. thanks to everyone offering help:)
Re: regular expression-xerox
by fletcher_the_dog (Friar) on May 03, 2004 at 19:03 UTC
    If your character classes have any character ranges in it, then you are going to have to make sure that you account for that. For example if you have "A-D" then you will want to convert that to "(?:A|B|C|D)" and not "(?:A|-|D)".