vitoco has asked for the wisdom of the Perl Monks concerning the following question:

I have a file with many records. Fields are delimited by tab. One of the fields is a list of comma separated keywords. What I want is to extract only those records that have keywords according to an expression given on runtime.

Is there a package available that can create a single regular expression starting from a string like the following ones?

moe&(shemp|curly|joe)&larry (tom&jerry)|(sylvester&tweety)

If I have to do that by myself, I can manage the ORs with something like this in the regex:

\b(one|two|three)\b

but I'm not sure how to handle the ANDs. I've done some tests with advanced expressions like (?=word) without understanding what is really going on.

I also thought on sorting each keywords list to allow simple regex being written as \bone\b.*\btwo\b, but is useless if I want to search for:

olive&(popeye|bluto)

Hints, please?

Replies are listed 'Best First'.
Re: AND and OR on Regular Expressions
by busunsl (Vicar) on Aug 25, 2009 at 14:07 UTC
    I think the easiest way to handle ANDs is using multiple regexen:
    if (/olive/ and /popeye|bluto/) { }
    But if you know the order of appearance (i.e. olive before popeye) you can use:
    /olive.*(popeye|bluto)/
Re: AND and OR on Regular Expressions
by ikegami (Patriarch) on Aug 25, 2009 at 14:32 UTC

    The match op looks for a series of patterns one after the other. It doesn't look at the big picture. It can do "or" easy because it simply checks "Is A or B at this spot?". However, that's not possible with "and". "Are A and B at this spot?" is not what you want.

    There are tricks to make the regex match search the string multiple times, each with a different pattern

    /^(?=.*?pat1)(?=.*?pat2)...(?=.*?patNm1)(?=.*?patN)/
    /^(?=.*?pat1)(?=.*?pat2)...(?=.*?patNm1).*?patN/

    but those are really just poorly readable versions of

    /pat1/ && /pat2/ && ... && /patNm1/ && /patN/

    Checking if a section of the string contains matches two patterns is much harder.

      There's no reason to use '.*?' instead of '.*' in your examples - a pattern that will match with '.*?' will match with '.*' as well. But '.*?' can be significantly slower than '.*'. I prefer to avoid '.*?' if '.*' will do.

      Note that the "one regexp" is still useful - you can pass in a single regexp to a function, 'qr' it, pass it in a webform, or store it in a configuration file which you usually cannot do with the && chained ones.

        There's no reason to use '.*?' instead of '.*' in your examplesl.

        It matters if the pattern contain captures.

        I must admit I approached my post backwards. I started with /pat1/ && /pat2/ and gave the functionally equivalent pattern. (Well, /.*?/ should actually be /(?s:.)*?/)

Re: AND and OR on Regular Expressions
by Anonymous Monk on Aug 25, 2009 at 14:10 UTC
Re: AND and OR on Regular Expressions
by vitoco (Hermit) on Aug 25, 2009 at 17:22 UTC

    Thanks to everyone.

    It seems that the following code does what I want:

    #!perl -w my $test = ( "olive&(popeye|bluto)", "(tom&jerry)|(sylvester&tweety)", "moe&(shemp|curly|joe)&larry", "tom&jerry|sylvester&tweety", # valid "moe ( shemp | curly | joe ) larry", # also valid "moe(&shemp|curly|joe)&larry", # invalid: "(&" instead of "&(" "moe ( shemp | curly | joe", # invalid: missing ")" raises an error )[(shift) - 1]; my $expr = $test; $expr =~ s/(\w+)/"(?=.*\\b($1)\\b)"/ge; $expr =~ s/[\&\s]//g; $expr = "^($expr)"; print "$test\n$expr\n----\n"; while (<DATA>) { print $_ if /$expr/; } __DATA__ tom,jerry jerry,tom jerry,tomas sylvester,tweety tweeter,sylvester tom,sylvester popeye,olive olive,brutus moe,larry shemp,curly,joe larry,moe larry,curly,moe
    >re.pl 1 olive&(popeye|bluto) ^((?=.*\b(olive)\b)((?=.*\b(popeye)\b)|(?=.*\b(bluto)\b))) ---- popeye,olive >re.pl 2 (tom&jerry)|(sylvester&tweety) ^(((?=.*\b(tom)\b)(?=.*\b(jerry)\b))|((?=.*\b(sylvester)\b)(?=.*\b(twe +ety)\b))) ---- tom,jerry jerry,tom sylvester,tweety >re.pl 3 moe&(shemp|curly|joe)&larry ^((?=.*\b(moe)\b)((?=.*\b(shemp)\b)|(?=.*\b(curly)\b)|(?=.*\b(joe)\b)) +(?=.*\b(larry)\b)) ---- larry,curly,moe

    I know this will raise an error if the expression is invalid, but I'm sure it could be checked first with something "simple" like the following:

    die "Invalid expression <$test>\n" if $test =~ /[^a-z\s\&\|\(\)]|^\s*[\&\|]|[\&\|]\s*$|[\&\|]\s*[\&\|]| +[\&\|]\s*\)|\(\s*[\|\&]/;

    and I'm sure I can find something more to validate that parenthesis are well paired.

      Added another improvement to this converter: "word not in keywords" feature. This is becoming interesting!

      This is an updated code that tries many queries with a simple expression validation included:

      #!perl -w my @data = <DATA>; for my $test ( # ( "olive&(popeye|bluto)", "(tom&jerry)|(sylvester&tweety)", "moe&(shemp|curly|joe)&larry", "moe curly larry", # "&" is optional "moe&!curly&larry", # curly not present "moe ( shemp | curly | joe ) larry", # "|" is required "jerry -tom", # standard way of "AND" and "AND NOT"... "(moe)((shemp)|(curly)|(joe))(larry)", # also this "tom&jerry|sylvester&tweety", # use re's default precedence "moe(&shemp|curly|joe)&larry", # error: "(&" instead of "&(" "moe ( shemp | curly | joe", # error: missing ")" "moe ) curly ( larry", # invalid: bad grouping "(olive)&(popeye|(bluto|brutus)))", # error: extra ")" "(jerry)((tweety))( )", # error: empty group "jerry||tweety", # error: empty word "olive - (bluto | brutus)", # only words can be excluded "olive - bluto - brutus", # Ok, spaces ignored. "(curly|!larry)&!moe", # valid, but senseless OR "moe&!(!curly)&larry", # error: curly present? "tom -!jerry" # error: don't try... # )[(shift)-1] ) { print "\n\n$test\n::\n"; (print("ERROR: Invalid expression\n") , next) if $test =~ /[^a-z\s\&\|\(\)\!\-]|^\s*[\&\|]|[\&\|]\s*$|[\&\|\(]\s +*[\&\|\)]|[\!\-]\s*[^a-z\s]/; # not_valid_chars |op_begins | op_ends | no_consec +utive_ops| negated_operator my $pars = $test; my $i = 0; $i++ while $pars =~ s/\((.*?)\)/$1/; (print("ERROR: Unpaired $1 of other $i pairs found\n") , next) if $pars =~ /([\(\)])/; my $expr = $test; $expr =~ s/([!\-]?)\s*(\w+)/($1?"(?!":"(?=").".*\\b$2\\b)"/ge; $expr =~ s/[\&\s]//g; $expr = "^($expr)"; print "$expr\n::\n"; print grep /$expr/, @data; } __DATA__ tom,jerry jerry,tom jerry,tomas sylvester,tweety tweeter,sylvester tom,sylvester popeye,olive olive,brutus moe,larry shemp,curly,joe larry,moe larry,curly,moe

      To try just one of the queries, remove the comment's chars from the for at the begining, and give a number (starting from 1) as an argument in the command line.

      I also removed the captures while building the regexp because them aren't used. I've just put them there for clarification.

      The validation code is simple because the allowed syntax is simple too, and is very tied to regular expressions.

      BTW, is there a better way to write the lines to check for parity of parentheses?

        I would rather use:

        $i++ while $pars =~ s/\(([^()]*)\)/$1/;

        Better way is to use Regexp::Common and balanced pattern (not tested):

        use Regexp::Common; $pars =~ /^[^()]*$RE{balanced}{-parens=>'()'}[^()]*$/