Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Regex AND

by ady (Deacon)
on Dec 02, 2004 at 11:35 UTC ( [id://411709]=perlquestion: print w/replies, xml ) Need Help??

ady has asked for the wisdom of the Perl Monks concerning the following question:

I have this set or regexes:
^(?!CX36(5|6)) ^(?!JA30[0-2]) ^(?!JA3(([2-8]\d)|(9[0-4]))) ^(?!JA5.*) ^(?!(JA6((0\d)|(1[0-3])))) ^(?!JA64[7-9]) ^(?!JA687.*) ^(?!JA74[0-3]) ^(?!JB5.*) ^(?!(JY(((1|2)\d\d)|(3[0-3]\d)))) ^(?!JY[3-9][5-9]\d) ^(?!JZ51(3|4)00.*)
I must combine them to one regex (for feeding ino a parsing program).

What i need is an 'AND' operator, but how's that done in Perl RE ?

Best regards Allan Dystrup

20041202 Edit by ysth: add code and p tags

Replies are listed 'Best First'.
Re: Regex AND
by Corion (Patriarch) on Dec 02, 2004 at 12:50 UTC

    Of course, the most easy way is to concatenate your regular expressions, as they are all zero-width lookaheads:

    m!^(?!CX36(5|6))(?!JA30[0-2])...!

    But you should really give it more thought - why do you need to assert that some stuff is not present. In most parsing tools, you can order the recognition steps in such a way, that you don't need negative assertions, by ordering the more specific rules before the less specific rules. Efficiency will also become a matter if you have more than a few regular expressions and/or more than a few bytes to match on, that is, unanchored matches.

    It might help if you'd tell us what parsing tool you are using, and why you think that your strings are all anchored at the beginning - as a very easy early-out optimization, I see that, for example, substr($_,0,1) ne 'J' && substr($_,0,1) ne 'C' will assert that none of your regexes match, which could be way faster than regular expressions given a suitably large alphabet to match - maybe you want something else? What problem are you trying to solve.

      Yes indeed!, as you indicate Corion this combined regex does the trick, - for the specified example :
      ^((?!CX36(5|6))(?!JA30[0-2])(?!JA3(([2-8]\d)|(9[0-4])))(?!JA5.*)(?!(JA +6((0\d)|(1[0-3]))))(?!JA64[7-9])(?!JA687.*)(?!JA74[0-3])(?!JB5.*)(?!( +JY(((1|2)\d\d)|(3[0-3]\d))))(?!JY[3-9][5-9]\d)(?!JZ51(3|4)00.*))

      This seems to me the easiest way to solve the problem, though undoubtably not the most efficient. But the tradeoff does cut the cheese.

      Thanks a lot Allan

Re: Regex AND
by rrwo (Friar) on Dec 02, 2004 at 12:15 UTC

    It looks like you're using a bunch of negative look-ahead assertions to make sure your strings don't start with certain patterns. There are ways to combine them, but you'll have something that's a bit hairy and inefficient. I would rethink what you're parsing a bit, perhaps focusing on positive rather than negative matches for the data you want.

    I recall there being a Regexp merging module on CPAN, but I've never used it and cannot find it at the moment. It might be helpful for you.

    Check the regular expressions manpage here. I also recommend reading the Mastering Regular Expressions book (O'Reilly information is here and author's web site here) for a tutorial about optimizing regular expressions.

      That'll be Regexp::Optimizer, which "does, ahem, attempts to, optimize regular expressions" — it performs trie optimization which I believe does not work in this particular case.

        The original author may have been thinking of Regexp::Assemble. I can't say if this module will help with this particular problem.

        Regards,
        Rick
Re: Regex AND
by mkirank (Chaplain) on Dec 02, 2004 at 12:44 UTC
    Why cant you use something like if (/regex1/ and /regex2) perldoc perlre says.
    "The deeper underlying truth is that juxtaposition in regular expressions always means AND, except when you write an explicit OR using the vertical bar. "/ab/" means match "a" AND (then) match "b", although the attempted matches are made at different positions because "a" is not a zero width assertion, but a one width assertion. "
    Hope this is of some help
Re: Regex AND
by ady (Deacon) on Dec 02, 2004 at 14:16 UTC
    A little more background on the domain of this problem:

    I've written a tool (in Perl) for transforming data on enterprise applications (modules & relations) to an input format for graphic display (nodes & arcs).

    The node names have the general format:

    [A-Z]{2}\d{5}[A-Z]?

    Part of the tool allows you to enter a regex (in a textbox), the program compiles the regex and uses it as a filter to parse the data (eg. discard data line if node-name !~ node-filter).

    For instance you can specify the following regex:

    (CX36(5|6))|(JA30[0-2])|(JA3(([2-8]\d)|(9[0-4])))|(JA5.*)|(JA6((0\d)|( +1[0-3])))|(JA64[7-9])|(JA687.*)|(JA74[0-3])|(JB5.*)|(JY(((1|2)\d\d)|( +3[0-3]\d)))|(JY[3-9][5-9]\d)|(JZ51(3|4)00.*)
    to indicate that you're only interested in source modules matching the following name conventions (which is an example of an actual application domain) :
    CX365-CX366 JA300-JA302 JA320-JA394 JA5* JA600-JA613 JA647-JA649 JA687* JA740-JA743 JB5* JY100-JY339 JY350-JY999 JZ51300* JZ51400*
    Now it's also often relevant to filter on nodes NOT matching a given application domain (in effect the complement of the domain definition), - for the above example all modules which pass a filter combining the following regex'es:
    ^(?!CX36(5|6)) ^(?!JA30[0-2]) ^(?!JA3(([2-8]\d)|(9[0-4]))) ^(?!JA5.*) ^(?!(JA6((0\d)|(1[0-3])))) ^(?!JA64[7-9]) ^(?!JA687.*) ^(?!JA74[0-3]) ^(?!JB5.*) ^(?!(JY(((1|2)\d\d)|(3[0-3]\d)))) ^(?!JY[3-9][5-9]\d) ^(?!JZ51(3|4)00.*)
    Thus the need to combine (AND) the "negated" rexeg'es into one big regx and pass that to the parsing/filtering program.

    Allan

      Why can't you negate the first regex to capture all those which don't match? I am assuming that my question is stupid, so please have patience with me. Is the problem that the second regex may be different from the negation of the first?
        Well, i'd have to open the perl program and change the !~ op to the =~ op each time i want filtering on a "negated domain".

        I could do that, but i prefer a way to express the regex complement directly as a new regex (to be fed to the program). -- And the way to do that was shown by Corion above.

        Best regards / allan

        ... then again, yes i could modify the GUI with a checkbox indicating "straight/negated", and switch the perl comparison operator accordingly. In the end i guess i was intrigued by the "how to climb it", as a regex...

Re: Regex AND
by periapt (Hermit) on Dec 02, 2004 at 12:53 UTC
    You could try joining the individual regexes with the boolean operator
    $myvar =~ /^(?!CX36(5|6))/ && $myvar =~ /^(?!JA30[0-2])/ && ...


    PJ
    use strict; use warnings; use diagnostics;
Re: Regex AND
by eyepopslikeamosquito (Archbishop) on Dec 03, 2004 at 08:39 UTC

    This is discussed in the Perl Cookbook recipe 6.18 "Expressing AND, OR, and NOT in a Single Pattern".

Actually, regex::assemble would help!
by tphyahoo (Vicar) on Dec 03, 2004 at 20:47 UTC
    I posted earlier that regex::assemble wouldn't help with your problem, because it's "regex or" not "regex and". But it now occurs to me that in your particular situation, it might help -- efficiency wise. Because you are looking for a regex that does not match several regexes. And actually, in boolean logic that is the same as a does not match "regex1 or regex 2 or regex3". So before you wound up with something like
    (?!regex1)(?!regex2)(?!regex3)
    But you could use regex::assemble to do
    my $andedRegexes = Regexp::Assemble->new; $andedRegexes->add( 'regex1' ); $andedRegexes->add( 'regex2' ); $andedRegexes->add( 'regex3' ); #regex is now 'regex(1|2|3)' #which is more efficient
    and then do a negative lookahead on that. I'm not sure of the quoting syntax here though.
    $negatedAndedRegexes = (?=qr($andedRegexes))
    Actually I'm pretty sure that's wrong syntax. But you get the idea.

    (Could someone correct that?)

    Hope this helps!

    Thomas.

      Yes Thomas, if i chose to open the source and recode the parser, i could walk down that road.

      However my situation is more like (as pointed out by eyepopslikeamosquito above) the Perl Cookbok 6.18 :

      "... So you need to write a single pattern that matches either of two diffe +rent patterns (the "or" case) or both of two patterns (the "and" case +) or that reverses the sense of the match ("not"). This situation arises often in configuration files, web forms, or comm +and-line arguments. ..."

      So with that i do consider my problem solved, -- even though as it's written in the recipe :

      ...It's not a pretty picture, and in a regular program, you'd almost n +ever do this"...

      Sic!

      Best Regards / Allan Dystrup

      "...this very place is the Land of Lotuses..." / Hakuin Ekaku Zenji

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://411709]
Front-paged by ysth
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (7)
As of 2024-04-19 07:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found