Re: Regex AND
by Corion (Patriarch) on Dec 02, 2004 at 12:50 UTC
|
Of course, the most easy way is to concatenate your regular expressions, as they are all zero-width lookaheads:
m!^(?!CX36(5|6))(?!JA30[0-2])...!
But you should really give it more thought - why do you need to assert that some stuff is not present. In most parsing tools, you can order the recognition steps in such a way, that you don't need negative assertions, by ordering the more specific rules before the less specific rules. Efficiency will also become a matter if you have more than a few regular expressions and/or more than a few bytes to match on, that is, unanchored matches. It might help if you'd tell us what parsing tool you are using, and why you think that your strings are all anchored at the beginning - as a very easy early-out optimization, I see that, for example, substr($_,0,1) ne 'J' && substr($_,0,1) ne 'C' will assert that none of your regexes match, which could be way faster than regular expressions given a suitably large alphabet to match - maybe you want something else? What problem are you trying to solve. | [reply] [d/l] [select] |
|
Yes indeed!, as you indicate Corion this combined regex does the trick, - for the specified example :
^((?!CX36(5|6))(?!JA30[0-2])(?!JA3(([2-8]\d)|(9[0-4])))(?!JA5.*)(?!(JA
+6((0\d)|(1[0-3]))))(?!JA64[7-9])(?!JA687.*)(?!JA74[0-3])(?!JB5.*)(?!(
+JY(((1|2)\d\d)|(3[0-3]\d))))(?!JY[3-9][5-9]\d)(?!JZ51(3|4)00.*))
This seems to me the easiest way to solve the problem, though undoubtably not the most efficient. But the tradeoff does cut the cheese.
Thanks a lot
Allan
| [reply] [d/l] |
Re: Regex AND
by rrwo (Friar) on Dec 02, 2004 at 12:15 UTC
|
It looks like you're using a bunch of negative look-ahead assertions to make sure your strings don't start with certain patterns. There are ways to combine them, but you'll have something that's a bit hairy and inefficient. I would rethink what you're parsing a bit, perhaps focusing on positive rather than negative matches for the data you want.
I recall there being a Regexp merging module on CPAN, but I've never used it and cannot find it at the moment. It might be helpful for you.
Check the regular expressions manpage here.
I also recommend reading the Mastering Regular Expressions book (O'Reilly information is here and author's web site here) for a tutorial about optimizing regular expressions.
| [reply] |
|
That'll be Regexp::Optimizer, which "does, ahem, attempts to, optimize regular expressions" — it performs trie optimization which I believe does not work in this particular case.
| [reply] |
|
| [reply] |
|
|
Re: Regex AND
by mkirank (Chaplain) on Dec 02, 2004 at 12:44 UTC
|
Why cant you use something like if (/regex1/ and /regex2)
perldoc perlre says.
"The deeper underlying truth is
that juxtaposition in regular expressions always means AND, except when
you write an explicit OR using the vertical bar. "/ab/" means match
"a" AND (then) match "b", although the attempted matches are made at
different positions because "a" is not a zero width assertion, but a
one width assertion.
"
Hope this is of some help | [reply] |
Re: Regex AND
by ady (Deacon) on Dec 02, 2004 at 14:16 UTC
|
A little more background on the domain of this problem:
I've written a tool (in Perl) for transforming data on enterprise applications (modules & relations) to an input format for graphic display (nodes & arcs).
The node names have the general format:
[A-Z]{2}\d{5}[A-Z]?
Part of the tool allows you to enter a regex (in a textbox), the program compiles the regex and uses it as a filter to parse the data (eg. discard data line if node-name !~ node-filter).
For instance you can specify the following regex:
(CX36(5|6))|(JA30[0-2])|(JA3(([2-8]\d)|(9[0-4])))|(JA5.*)|(JA6((0\d)|(
+1[0-3])))|(JA64[7-9])|(JA687.*)|(JA74[0-3])|(JB5.*)|(JY(((1|2)\d\d)|(
+3[0-3]\d)))|(JY[3-9][5-9]\d)|(JZ51(3|4)00.*)
to indicate that you're only interested in source modules matching the following name conventions (which is an example of an actual application domain) :
CX365-CX366
JA300-JA302
JA320-JA394
JA5*
JA600-JA613
JA647-JA649
JA687*
JA740-JA743
JB5*
JY100-JY339
JY350-JY999
JZ51300*
JZ51400*
Now it's also often relevant to filter on nodes NOT matching a given application domain (in effect the complement of the domain definition), - for the above example all modules which pass a filter combining the following regex'es:
^(?!CX36(5|6))
^(?!JA30[0-2])
^(?!JA3(([2-8]\d)|(9[0-4])))
^(?!JA5.*)
^(?!(JA6((0\d)|(1[0-3]))))
^(?!JA64[7-9])
^(?!JA687.*)
^(?!JA74[0-3])
^(?!JB5.*)
^(?!(JY(((1|2)\d\d)|(3[0-3]\d))))
^(?!JY[3-9][5-9]\d)
^(?!JZ51(3|4)00.*)
Thus the need to combine (AND) the "negated" rexeg'es into one big regx and pass that to the parsing/filtering program.
Allan
| [reply] [d/l] [select] |
|
Why can't you negate the first regex to capture all those which don't match? I am assuming that my question is stupid, so please have patience with me. Is the problem that the second regex may be different from the negation of the first?
| [reply] |
|
Well, i'd have to open the perl program and change the !~ op to the =~ op each time i want filtering on a "negated domain".
I could do that, but i prefer a way to express the regex complement directly as a new regex (to be fed to the program). -- And the way to do that was shown by Corion above.
Best regards / allan
... then again, yes i could modify the GUI with a checkbox indicating "straight/negated", and switch the perl comparison operator accordingly. In the end i guess i was intrigued by the "how to climb it", as a regex...
| [reply] |
|
Re: Regex AND
by periapt (Hermit) on Dec 02, 2004 at 12:53 UTC
|
You could try joining the individual regexes with the boolean operator
$myvar =~ /^(?!CX36(5|6))/ && $myvar =~ /^(?!JA30[0-2])/ && ...
PJ
use strict; use warnings; use diagnostics;
| [reply] [d/l] |
Re: Regex AND
by eyepopslikeamosquito (Archbishop) on Dec 03, 2004 at 08:39 UTC
|
This is discussed in the
Perl Cookbook
recipe 6.18 "Expressing AND, OR, and NOT in a Single Pattern".
| [reply] |
Actually, regex::assemble would help!
by tphyahoo (Vicar) on Dec 03, 2004 at 20:47 UTC
|
I posted earlier that regex::assemble wouldn't help with your problem, because it's "regex or" not "regex and".
But it now occurs to me that in your particular situation, it might help -- efficiency wise. Because you are looking for a regex that does not match several regexes. And actually, in boolean logic that is the same as a does not match "regex1 or regex 2 or regex3".
So before you wound up with something like
(?!regex1)(?!regex2)(?!regex3)
But you could use regex::assemble to do
my $andedRegexes = Regexp::Assemble->new;
$andedRegexes->add( 'regex1' );
$andedRegexes->add( 'regex2' );
$andedRegexes->add( 'regex3' );
#regex is now 'regex(1|2|3)'
#which is more efficient
and then do a negative lookahead on that.
I'm not sure of the quoting syntax here though.
$negatedAndedRegexes = (?=qr($andedRegexes))
Actually I'm pretty sure that's wrong syntax. But you get the idea.
(Could someone correct that?)
Hope this helps!
Thomas.
| [reply] [d/l] [select] |
|
"...
So you need to write a single pattern that matches either of two diffe
+rent patterns (the "or" case) or both of two patterns (the "and" case
+) or that reverses the sense of the match ("not").
This situation arises often in configuration files, web forms, or comm
+and-line arguments.
..."
So with that i do consider my problem solved, -- even though as it's written in the recipe :
...It's not a pretty picture, and in a regular program, you'd almost n
+ever do this"...
Sic!
Best Regards / Allan Dystrup
"...this very place is the Land of Lotuses..." / Hakuin Ekaku Zenji
| [reply] [d/l] [select] |