jhoop has asked for the wisdom of the Perl Monks concerning the following question:
Hello, thanks to all for the wealth of knowledge here, it has been invaluable.. I have an issue that I can't seem to crack with my limited knowledge or searching.. Trying to parse through search-strings and extract all the non-control words. My input search-string might look like this:
"non$volatile display" and ((timer oR count$3 Or display) near5 hour).ccls. NOT (LCD).ab.
I would like to parse similar strings, extracting the search terms, and ignoring the control words (and their case) and everything between two periods, like .ccls. I would also like to preserve the wildcards and anything in "", like "non$volatile display" (where the $ can be anything, I could just keep the $.. and store anything between "" as a single string in the output. The output would be an array of the extracted substrings. Also, if there is a more efficient way to remove the dupes in this routine, I'm all ears..
My code so far is below - it manages to pull out all the abc substrings, ignoring lower-case control words and anything between periods.. Any thoughts?
sub extract_terms(){ my $input = shift; chomp $input; my @searchterms = ($input =~ m/\b(?!\.)[a-z]+(?!\.)\b/gi); my @omissions = qw(terms and or not with near same xor adj); my %h; @h{@omissions} = undef; @searchterms = grep {not exists $h{$_}} @searchterms; return @searchterms; }
Which outputs (after sorting):
count, display, hour, LCD, non, NOT, Or, oR, timer, volatile,
|
|---|