rsiedl has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

Can anyone tell me if it is possible to combine the following two regex's into one?
#/usr/bin/perl use strict; use warnings; my $string =<< "END"; ("Immunologic and Biological Factors"[MESH] OR "Immunosuppressive Agen +ts"[MESH] OR "Transplantation Immunology"[MESH] OR "Allergy and Immun +ology"[MESH] OR "Graft vs Host Disease"[MESH]) NOT ("Foo"[MESH] OR "B +ar"[MESH]) AND ("Kidney Transplantation"[MESH] OR "Liver Transplantat +ion"[MESH] OR "Heart Transplantation"[MESH]) NOT ("My Term"[MESH] OR +"Blah"[MESH]) NOT "foobar"[MESH] END # This removes all instances of 'NOT "anything"[MESH]' $string =~ s/ NOT ".*?"\[MESH\] ?//g; # This removes all instances of 'NOT (anything)' $string =~ s/ NOT \(.*?\) ?//g;

Cheers,
Reagen

Replies are listed 'Best First'.
Re: refine regex
by Corion (Patriarch) on Nov 26, 2004 at 11:34 UTC

    The process of unifying two regular expressions can be done in the following steps:

    1. Start with an empty regex : m!!
    2. Extract the common prefix : m! NOT !
    3. Extract the alternating parts A and B from both regexes, and put it in between (?:A|B) : m! NOT (?:".*?"\[MESH\]|\(.*?\))
    4. Go to step 2 if there is more of the regex(es) left : m! NOT (?:".*?"\[MESH\]|\(.*?\)) ?!
    5. Compare your new regex against the old regular expressions to confirm that it matches exactly what the old ones matched.

    That way, I end up with m! NOT (?:".*?"\[MESH\]|\(.*?\)) ?!.

      step 6. Test it! - it was necessary to add a small fix:

      #/usr/bin/perl use strict; use warnings; my $string_original =<< "END"; ("Immunologic and Biological Factors"[MESH] OR "Immunosuppressive Agen +ts"[MESH] OR "Transplantation Immunology"[MESH] OR "Allergy and Immun +ology"[MESH] OR "Graft vs Host Disease"[MESH]) NOT ("Foo"[MESH] OR "B +ar"[MESH]) AND ("Kidney Transplantation"[MESH] OR "Liver Transplantat +ion"[MESH] OR "Heart Transplantation"[MESH]) NOT ("My Term"[MESH] OR +"Blah"[MESH]) NOT "foobar"[MESH] END # original { my $string = $string_original; $string =~ s/ NOT ".*?"\[MESH\] ?//g; $string =~ s/ NOT \(.*?\) ?//g; print $string, "\n"; } # Corion's { my $string = $string_original; $string =~ s! NOT (?:".*?"\[MESH\]|\(.*?\)) ?!!g; print $string, "\n"; } # fixed { my $string = $string_original; $string =~ s! ?NOT (?:".*?"\[MESH\]|\(.*?\)) ?!!g; print $string, "\n"; }

        Don't need the trailing ' ?'.
        $string =~ s! ?NOT (?:".*?"\[MESH\]|\(.*?\))!!g;
        works fine.

        yeah, i picked that up to when i tested :)
        Cheers.
      Thanks Corion. Great explanation!
      Or you can convert it to an NFA and remove useless transitions etc ;) But let's not get into language theory. Good job.

      ----
      Then B.I. said, "Hov' remind yourself nobody built like you, you designed yourself"

Re: refine regex
by ikegami (Patriarch) on Nov 26, 2004 at 15:11 UTC

    Your code and the solutions already presented won't handle nested parens correctly. For example,
    NOT ("test(s)"[MESH] AND ("A"[MESH] OR "B"[MESH]))
    will fail. The solution below works better (although it uses an "experimental" regexp feature.

    #/usr/bin/perl use strict; use warnings; my $string = 'NOT ("test(s)"[MESH] AND ("A"[MESH] OR "B"[MESH]))'; my $parens_guts; # Can't combine this line with the next one. $parens_guts = qr/ (?: "[^"]*" | \( (??{ $parens_guts }) \) | [^"()] )* /sx; $string =~ s/ \s* NOT \s* (?: "[^"]*"\[MESH\] | \( (??{ $parens_guts }) \) ) //gsx; print("[$string]$/");

    I could write a Parse::RecDescent solution if you don't want to use the "experimental" (??{ ... }).

    Update: I went and did the Parse::RecDescent version for fun at lunch.

Re: refine regex
by ikegami (Patriarch) on Nov 26, 2004 at 17:47 UTC

    I found another problem. The string
    NOT "foo"[MESH] AND "bar"[MESH]
    will result in the nonsense
    AND "bar"[MESH]
    I don't know if this (a unary NOT) is something you'll encounter

    I found this problem while writting the RecParser version I just added as an update to 410587