glwtta has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

Straightforward problem, I have a database that uses the following syntax for boolean searches: "word1 word2" will find either word 1 or 2, "+word1 +word2" will find both and "+word1 -word2" will find 1 and not 2; etc, simple enough... just try explaining that to anyone using it though.

So what I need to do is convert the more familiar "(word1 AND word2) OR word3" syntax to this one (btw, parenthesis work the same way in both).

From whatever knowledge of CS I have I understand that this is pretty much beyond what regular expressions can do, but it also seems like doing some sort of parse tree would be overkill (plus I don't really know how to go about it) - any suggestions for a pragmatic approach to this?

Replies are listed 'Best First'.
Re: converting a boolean syntax
by BrowserUk (Patriarch) on Jun 23, 2003 at 22:29 UTC

    If you only need a mechanical conversion, then something like this might be a starting point

    #! perl -sw use strict; my @stack; while( <DATA> ) { print; s[NOT\s+][-]g; 1 while s[(?<!\+)(\b\w+\b)\s+AND\s+][+$1 AND ]g or s[\s+AND\s+(?<!\+)(\b\w+\b)][ AND +$1]g; s[\sAND\s][ ]g; s[\s+OR\s+][ ]g; print $_, $/; } __DATA__ word1 AND word2 word1 OR word2 word1 AND NOT word2 (word1 AND NOT word2) OR word3 word1 AND word2 AND word3 (word1 AND word2 OR word3) AND NOT (word4 OR word5) (word1 AND (word2 OR (word3 AND word4 AND NOT word5)) AND word6) OR wo +rd7

    Output

    D:\Perl\test>268325 word1 AND word2 +word1 +word2 word1 OR word2 word1 word2 word1 AND NOT word2 +word1 -word2 (word1 AND NOT word2) OR word3 (+word1 -word2) word3 word1 AND word2 AND word3 +word1 +word2 +word3 (word1 AND word2 OR word3) AND NOT (word4 OR word5) (+word1 +word2 word3) -(word4 word5) (word1 OR (word2 OR (word3 AND word4 AND NOT word5)) AND word6) OR wor +d7 (word1 (word2 (+word3 +word4 -word5)) +word6) word7

    If you want to syntax check the input before conversion then you have more work to do. Regexp::Common::balanced would be a good starting point for extracting the sub-expressions.

    Might be fun to code, but it would need a much fuller explanation of what is leagal in the output format.

    Eg. is -(sub condition) legal?

    What about (+word1 +word2) (+word3 +word4)?

    The toughest part is probably giving meaningful error messages when you encounter syntax errors.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: converting a boolean syntax
by Abigail-II (Bishop) on Jun 23, 2003 at 22:32 UTC
    Well, Perl regular expressions can do that, but not "regular regular expressions", as they don't have a general way of saying "not", nor have they a non-exponential way of addressing "all of a set, regardless of their order".

    I don't know why you say using some sort of parse tree is going to be overkill. Boolean expressions are amongst the simplest non-trivial expressions to parse; it's a typical excercise for students during the first hours of a compiler course. It shouldn't be too hard to parse this using Perl, even if you don't use Parse::RecDescent.

    The reason I don't include some code in this post is that the specification of your target syntax isn't complete: it doesn't explain grouping. How do you express

    (word1 AND word2) OR word3
    in the target syntax? And how is
    word1 AND (word2 OR word3)
    expressed?

    Abigail

Re: converting a boolean syntax
by Tomte (Priest) on Jun 23, 2003 at 22:34 UTC

    While this has no not, it might be a starting point and accepts only balanced parenthesis (see last __DATA__ string)

    result = ''; my $re = qr{ \( (?: (?>[^()]+) | (??{ $re }) )* \) | (?>[^()]+) }x; while(<DATA>) { print parse($_) . "\n\n\n"; } sub parse { my $line = shift; my $op = ''; my $result; while($line =~ m/\G($re)/g){ my $part = $1; if ($part =~ m/\((.*)\)/) { # previously greedy, non working! $result .= " ( "; $result .= parse($1); # did not recures on prior version $result .= ' ) '; } else { my ($op, $w1, $w2) = matchPart($part); $result .= " ". $op . $w1; $result .= " " . $op . $w2 if $w2; } } return $result; } sub matchPart { my $part = shift; if ($part =~ m/(-?[\w]*)\s+(and|or)\s+(-?\w+)/i) { my $op = $2 eq 'and' ? '+' : $2 eq 'not' ? '-': ''; return ($op, $1, $3); } return ('', '',''); } __DATA__ (word1 or word2) and (word3 or word4) word1 or word2 word2 and word3 (word1 and word2) or word3 (word1 or word2) and (word3 or word4 ((word1 or word2) and (word3 or word4)) __END__ ( word1 word2 ) ( word3 word4 ) word1 word2 +word2 +word3 ( +word1 +word2 ) word3 ( word1 word2 ) ( ( word1 word2 ) ( word3 word4 ) )

    Update: Fixed the code as indicated via comments, to actually match nested parenthesis.

    Edit: Fixed strong tags, thanks to abell

    regards,
    tomte


    Hlade's Law:

    If you have a difficult task, give it to a lazy person --
    they will find an easier way to do it.

Re: converting a boolean syntax
by diotalevi (Canon) on Jun 23, 2003 at 22:35 UTC

    Something like this.

    @required = $str =~ /\+(\w+)/g; @optional = $str =~ /(?<![+-])(\w+)/g; @absent = $str =~ /-(\w+)/g; $required = @required ? '(' . join(' AND ', @required) . ')' : ''; $optional = @optional ? '(', join(' OR ', @optional) . ')' : ''; $absent = @absent ? '(' . join(' AND ', map "NOT $_", @absent) . ')' : + ''; $expr = join ' OR ', grep length(), join(' AND ', grep length(), $required, $absent), $optional;
Re: converting a boolean syntax
by parv (Parson) on Jun 23, 2003 at 22:49 UTC

    How about this (example used is "(A AND B) OR (C OR D NOT E) OR Q"):

    1. Break the statement into smaller ones on the joiner -- AND, OR, NOT -- which are not inside parent expression. That will yield: { OR <-- { (A AND B) , (C OR D NOT E) } , OR <-- { Q } }.
    2. Go thru to each of the component of the broken down tokens, to search for the joiner as above, giving: { AND <-- { A, B } , OR <-- { C , D , NOT E } }.
    3. Do the above for all other sub components, till all are exhausted.
    4. Ending up w/...
      { OR <-- { AND <-- { A , B } , OR <-- { C , D , NOT <-- { E } } } , OR <-- { Q } }
    5. Last step would be add the appropriate signs (or nothing at all) based on parent joiner, which can be easily done in a loop.

    Well, that is what currently comes to mind, bugs expected.

    Updated (Jun 24 2003) the parsed expressions as after adding "OR Q" in the example, forgot to correctly update the hierarchy.

Re: converting a boolean syntax
by artist (Parson) on Jun 23, 2003 at 22:22 UTC
Re: converting a boolean syntax
by kutsu (Priest) on Jun 23, 2003 at 21:32 UTC