George_Sherston has asked for the wisdom of the Perl Monks concerning the following question:

Sibling monks, I've been going round in circles on this for a few fruitless hours, and before I disappear up my own fundament I want to crave your indulgence. I have a text field in my CGI which gives search terms for a search engine. These terms can either be single words, or quote-delimited (" or ') phrases; they can be prefixed with a - or a + to indicate their priority. What I need to end up with is three arrays, @KeysNeed, @KeysAvoid, @Keys which together contain all the words / phrases (with non-word chars and extra whitespace stripped). The first shd contain all words / phrases prefixed with +, the second all those with - and the last all the rest.

My latest iteration of a solution, which doesn't quite work (doesn't recognise prefixed phrases) is below. But when I step back and look at it, I can't help feeling that, as well as not quite working, it's also woefully complex and clunky. Is there a regexmeister / regexmeisterin out there who can point me more in the right direction?
my @KeysNeed; my @KeysAvoid; my @Keys; my $keywords = $q->param('keywords'); # get the phrases out of $keywords and store them: while ($keywords =~ s/([+-]*)["'](.*?\W?.*?)["']//) { my $cat = $1; my $keyword = $2; $keyword =~ s/\s+/ /g; if (defined $1 and $1 eq '+') { push @KeysNeed, $keyword if $keyword; } elsif (defined $1 and $1 eq '-') { push @KeysAvoid, $keyword if $keyword; } else { push @Keys, $keyword if $keyword; } } # strip surplus whitespace: $keywords =~ s/\s+/ /g; # get the remaining words out of $keywords and store them: my @keywords = split /\s+/, $keywords; for (@keywords) { s/^([+-])*//; s/[^a-zA-Z0-9 ]//g; if (defined $1 and $1 eq '+') { push @KeysNeed, $_ if $_; } elsif (defined $1 and $1 eq '-') { push @KeysAvoid, $_ if $_; } else { push @Keys, $_ if $_; } }


§ George Sherston

Replies are listed 'Best First'.
Re: Regexing my Search Terms
by merlyn (Sage) on Jan 07, 2002 at 19:51 UTC
Re: Regexing my Search Terms
by chipmunk (Parson) on Jan 07, 2002 at 20:44 UTC
    Your regex seems strange, specifically the \W? between the two occurences of .*?. That will change how the regex matches when the first quote is immediately followed by another quote, but I'm not sure why you'd want that.

    Two other things to note... There's an assignment from $1 to $cat, but $1 is used instead of $cat in the rest of the loop. The compression of whitespace is redundant before a split that splits on runs of whitespace anyway.

    With some alternation and $+ (for the last capturing parens that actually matched), this can be done with a single loop:

    while ($keywords =~ /([-+]?)(?:'([^']*)'|"([^"]*)"|(\S+))/g) { my $cat = $1; my $keyword = $+; if ($keyword) { if ($cat eq '+') { push @KeysNeed, $keyword; } elsif ($cat eq '-') { push @KeysAvoid, $keyword; } else { push @Keys, $keyword; } } }
      The .*?\W?.*? bit was meant to kick out any phrases that were a hundred percent non-word characters... which I now realise it would not actually do... I needed .*?\W+?.*?... but that wasn't the only thing wrong!

      § George Sherston
Re: Regexing my Search Terms
by joealba (Hermit) on Jan 07, 2002 at 22:24 UTC
    Here's a cleaner, less-buggy version of your code to get you going. You'll want to do the usual parameter sanitization, depending on your specific needs.
    use strict; my (@Keys,@KeysNeed,@KeysAvoid); my $keywords = qq(word +"my phrase" -"my phrase's mom" +other keywords +); while ($keywords) { $keywords =~ s/^\s*([+-]*)(((["'])(.*?)\4)|\S+)\s*//; my $cat = $1; my $keyword = $2; $keyword =~ s/^[^\w\s]+//g; # Strip special chars from the ends $keyword =~ s/[^\w\s]+$//g; next unless $keyword; if ($cat =~ /\+/) { push @KeysNeed, $keyword; } elsif ($cat =~ /\-/) { push @KeysAvoid, $keyword; } else { push @Keys, $keyword; } } print "KEYS NEEDED: @KeysNeed\n"; print "KEYS TO AVOID: @KeysAvoid\n"; print "KEYS: @Keys\n"; # prints: # KEYS NEEDED: my phrase other # KEYS TO AVOID: my phrase's mom # KEYS: word keywords
    Here's a quick explanation of my changes to your regexp:
    s/^\s*([+-]*)(((["'])(.*?)\4)|\S+)\s*//
    • Pull keywords directly from the start of the var.
    • Ignore leading whitespace.
    • Pull any number of +- signs. (Note: This is odd. If they have +-+-+ before a keyword, it'll pull it. The code above considers the + operator to be of highest precedence)
    • Next is the fun one -- If it starts with a quote char, pull everything until you reach the next *matching* quote char. In your code, you look for both chars at the beginning and end, so "my phrase's mom" would get broken up like "my phrase', which is not what you would want.
    • If there is no quote char, pull all chars until you hit a space or the end of the string.
Re: Regexing my Search Terms
by Aristotle (Chancellor) on Jan 07, 2002 at 23:09 UTC
    #!/usr/bin/perl -w use strict; $_ = qq(a "test phrase" +for -"this" +"reg ex"); my (@allkeys, @keys, @keysneed, @keysavoid); # break string in parts push @allkeys, $1 while /([+-]?(".*?"|'.*?'|\S*))/g; # filter accoording to first character @keys = grep /^[^+-]/, @allkeys; @keysneed = grep /^\+/, @allkeys; @keysavoid = grep /^\-/, @allkeys; # remove + and - s/^.// for @keysneed, @keysavoid; # remove bracketing quotes s/^(["'])(.*)\1$/$2/ for @keysneed, @keysavoid, @keys; print map "$_\n", '', @keys, '', @keysneed, '', @keysavoid, ''; __END__ OUTPUT: a test phrase for reg ex this
    Took a bit of fiddling, I couldn't get it right with ["'] for some reason. :-/

    Alternatively, you could always listen to merlyn and refer to CPAN :-)
Re: Regexing my Search Terms
by dreadpiratepeter (Priest) on Jan 09, 2002 at 18:48 UTC
    Assuming that you don't want to strip special characters out of quoted phrases; this is shorter, but uglier. It should handle backslashed quotes as well:
    #!/usr/local/bin/perl -w use strict; use diagnostics; my $str="foo bar -\"weenie roast\" +\"'foo' bar\" \\' -baz +boo +bee / + +ben fat faz -slim 'ooga booga' 'fee fie \\' foe fum' \\'do ray\\' +"; my (@KeysNeed,@KeysAvoid,@Keys); 1 while $str =~ /\+(\w+|((?<!\\)['"])(.*?)(?<!\\)(??{$2}))(?{push @Key +sNeed,$3||$1})|-(\w+|((?<!\\)['"])(.*?)(?<!\\)(??{$5}))(?{push @KeysA +void,$6||$4})|(\w+|((?<!\\)['"])(.*?)(?<!\\)(??{$8}))(?{push @Keys,$9 +||$7})/g; use Data::Dumper; print Dumper(\@KeysNeed,\@KeysAvoid,\@Keys);