Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,
I'm trying to split a string such as:
dogs OR cats OR "flying fish" OR (shrimp AND squid)
into an array where the 1st element is "OR", 2nd is "dogs", 3rd is "cats", 4th is "flying fish", and 5th is a reference to another array whose 1st element is "AND", 2nd is "shrimp", and 3rd is "squid".
If you're not beginning to feel at least a little confused you're smarter than I. But anyway, the program should be able to do this even if there are more or less terms, and even if there are parenthetical groups inside parenthetical groups, so long as the terms directly inside any one parenthetical group are connected by the same operator (i.e. either "AND" or "OR").
Any ideas? Thanks in advance.

Replies are listed 'Best First'.
Re: Break string into array
by ikegami (Patriarch) on Sep 18, 2009 at 06:15 UTC
    It's basically the same problem as recursive algorithm for nested data structures

    Update: Actually, it's a bit more different than I thought because you said you can't have both OR and AND in the same list.

    # make_parser.pl use strict; use warnings; use Parse::RecDescent qw( ); my $grammar = <<'__EOI__'; { use strict; use warnings; sub dequote { my ($s) = @_; $s =~ s/^"//; $s =~ s/"\z//; $s =~ s/\\(.)/$1/sg; return $s; } } parse : list /\Z/ { $item[1] } list : term list_[ $item[1] ] list_ : "AND" <commit> and_list { [ $item[1] => $arg[0], @{$item[3]} ] } | "OR" <commit> or_list { [ $item[1] => $arg[0], @{$item[3]} ] } | { $arg[0] } and_list : term and_list_ { [ $item[1], @{$item[2]} ] } and_list_ : "AND" <commit> term and_list_ { [ $item[3], @{$item[4]} ] } | { [] } or_list : term or_list_ { [ $item[1], @{$item[2]} ] } or_list_ : "OR" <commit> term or_list_ { [ $item[3], @{$item[4]} ] } | { [] } term : IDENT | STRING | '(' list ')' { $item[2] } IDENT : /\w+/ STRING : /"(?:\\.|[^\\])*"/s { dequote($item[1]) } __EOI__ Parse::RecDescent->Precompile($grammar, 'Parser') or die("Bad grammar\n");
    # test.pl use strict; use warnings; use Data::Dumper qw( Dumper ); use Parser qw( ); my $text = 'dogs OR cats OR "flying fish" OR (shrimp AND squid)'; my $parser = Parser->new(); print(Dumper($parser->parse($text)));
    >perl make_parser.pl >perl test.pl $VAR1 = [ 'OR', 'dogs', 'cats', 'flying fish', [ 'AND', 'shrimp', 'squid' ] ];

    Update: Small changes to grammar to eliminate backtracking.

      Here is how that might look with Regexp::Grammars
      #!/usr/bin/perl -- use strict; use warnings; my $s = q[dogs OR cats OR "flying fish" OR (shrimp AND squid)]; my $parser = do { use Regexp::Grammars; qr{ # <logfile: - > <[TERM]>* <rule: TERM> <OP> | <MATCH=IDENT> | <MATCH=STRING> | <LIST> <rule: STRING> "([^"]+?)" <rule: OP> AND|OR <rule: IDENT> \w+ <rule: LIST> \( <[TERM]>* \) }xs }; if($s =~ $parser){ my(%rash) = %/;#bah for scite lexer /# undef %/;# bah for scite lexer /# use Data::Dumper(); print Data::Dumper->new([\%rash])->Indent(1)->Useqq(1)->Dump,"\n"; kek(\%rash); # kill empty key print Data::Dumper->new([\%rash])->Indent(1)->Useqq(1)->Dump,"\n"; my $rash = reorder_terms(\%rash); # consumes %rash print Data::Dumper->new([$rash])->Indent(1)->Useqq(1)->Dump,"\n"; } sub reorder_terms { my( $ref ) = @_; if( $$ref{TERM}){ my @term; my @op; for my $t( @{$$ref{TERM}} ){ if( ref $t ){ if( $$t{OP} ){ push @op, delete $$t{OP}; }elsif( $$t{LIST} ){ push @term, reorder_terms(delete $$t{LIST} ); }else{ die "uh oh, no OP or LIST key"; } } else { push @term, $t; } } undef %$ref; #return [@op, @term ]; return [$op[0], @term ]; } die "uh oh, no TERM key"; } sub kek { my ($ref) = @_; my $typ = ref $ref; if( $typ eq 'HASH'){ delete $$ref{""}; for my $val( values %$ref){ ref $val and kek($val); } } if( $typ eq 'ARRAY'){ for my $val( @$ref){ ref $val and kek($val); } } return; } __END__ $VAR1 = { "" => "dogs OR cats OR \"flying fish\" OR (shrimp AND squid)", "TERM" => [ "dogs", { "" => " OR", "OP" => "OR" }, "cats", { "" => " OR", "OP" => "OR" }, "\"flying fish\"", { "" => " OR", "OP" => "OR" }, { "" => " (shrimp AND squid)", "LIST" => { "" => "(shrimp AND squid)", "TERM" => [ "shrimp", { "" => " AND", "OP" => "AND" }, "squid" ] } } ] }; $VAR1 = { "TERM" => [ "dogs", { "OP" => "OR" }, "cats", { "OP" => "OR" }, "\"flying fish\"", { "OP" => "OR" }, { "LIST" => { "TERM" => [ "shrimp", { "OP" => "AND" }, "squid" ] } } ] }; $VAR1 = [ "OR", "dogs", "cats", "\"flying fish\"", [ "AND", "shrimp", "squid" ] ];
      Uncomment <logfile: - > for some debug. See also KinoSearch::Docs::Cookbook::CustomQueryParser, Text::Query.

        Not only does your solution allow

        my $s = q[dogs OR cats AND "flying fish" OR (shrimp AND squid)];
        it parses to the same as
        my $s = q[dogs OR cats OR "flying fish" OR (shrimp AND squid)];

        And it also allows

        my $s = q[OR dogs OR cats OR "flying fish" OR (shrimp AND squid)];

        Finally, quoted strings are left quoted. A parser shouldn't return literals. If you want to differentiate between quoted and unquoted terms, you'll need to add to the parse tree.

        $VAR1 = [ "OR", [ term => "dogs" ], [ term => "cats" ], [ phrase => "flying fish" ], [ "AND", [ term => "shrimp" ], [ term => "squid" ], ] ];
Re: Break string into array
by Anonymous Monk on Sep 18, 2009 at 06:41 UTC
    Use Search::QueryParser
    #!/usr/bin/perl -- use strict; use warnings; ## 2009-09-17- 23:40:30 ## D:\dev\misc\search.queryparser.pl use Search::QueryParser; my $qp = Search::QueryParser->new; my $s = q[dogs OR cats OR "flying fish" OR (shrimp AND squid)]; my $query = $qp->parse($s) or die "Error in query : " . $qp->err; use Data::Dumper; print Dumper( $query ),"\n"; __END__ $VAR1 = { '' => [ { 'value' => 'dogs', 'op' => ':', 'field' => '' }, { 'value' => 'cats', 'op' => ':', 'field' => '' }, { 'quote' => '"', 'value' => 'flying fish', 'op' => ':', 'field' => '' }, { 'value' => { '+' => [ { 'value' => 'shrimp', 'op' => ':', 'field' => '' }, { 'value' => 'squid', 'op' => ':', 'field' => '' } ] }, 'op' => '()', 'field' => '' } ] };

      Good find! It even prevents mixing OR and AND like the OP requested.

      Error in query : [dogs OR cats AND "flying fish" OR (shrimp AND squid) +] : cannot mix AND/OR in requests; use parentheses

      Unfortunately, it does a lot more than the OP requested, but it's just a matter of scanning the output for unsupported constructs