Regex with multiple pattern omissions

jhoop has asked for the wisdom of the Perl Monks concerning the following question:

Hello, thanks to all for the wealth of knowledge here, it has been invaluable.. I have an issue that I can't seem to crack with my limited knowledge or searching.. Trying to parse through search-strings and extract all the non-control words. My input search-string might look like this:

"non$volatile display" and ((timer oR count$3 Or display) near5 hour).ccls. NOT (LCD).ab.

I would like to parse similar strings, extracting the search terms, and ignoring the control words (and their case) and everything between two periods, like .ccls. I would also like to preserve the wildcards and anything in "", like "non$volatile display" (where the $ can be anything, I could just keep the $.. and store anything between "" as a single string in the output. The output would be an array of the extracted substrings. Also, if there is a more efficient way to remove the dupes in this routine, I'm all ears..

My code so far is below - it manages to pull out all the abc substrings, ignoring lower-case control words and anything between periods.. Any thoughts?

sub extract_terms(){
my $input = shift;
chomp $input;
my @searchterms = ($input =~ m/\b(?!\.)[a-z]+(?!\.)\b/gi);
my @omissions = qw(terms and or not with near same xor adj);

my %h;
@h{@omissions} = undef;
    
@searchterms = grep {not exists $h{$_}} @searchterms;
return @searchterms;
}
[download]

Which outputs (after sorting):

count, display, hour, LCD, non, NOT, Or, oR, timer, volatile,

Comment on Regex with multiple pattern omissions Download Code

Replies are listed 'Best First'.
Re: Regex with multiple pattern omissions by jwkrahn (Abbot) on Jan 09, 2011 at 01:33 UTC
`sub extract_terms(){ my $input = shift; chomp $input;` [download] Your prototype says "accept NO arguments" for this subroutine. Your next statement says that this subroutine accepts ONE argument. Do not use prototypes unless you are attempting to imitate one of Perl's built-in functions. Far More than Everything You've Ever Wanted to Know about Prototypes in Perl -- by Tom Christiansen Your use of chomp here implies that this subroutine only deals with lines input from files? Where is the rest of the file line handling code?	[reply] [d/l]
Re^2: Regex with multiple pattern omissions by jhoop (Acolyte) on Jan 09, 2011 at 02:06 UTC
I think I just copied the starting lines of this subroutine in from one of the other subroutines, chomp doesn't need to be there. I'm pretty new to this, and so may be using shift incorrectly/unnecessarily, but this has worked so far. The input for this subroutine is a block of text consisting of a (long) list of \n-separated search-strings. For this routine, I don't need them split, as I'm extracting the pertinent terms from ALL searches and compiling an alphabetized list of non-dupes.	[reply]
Re^3: Regex with multiple pattern omissions by Marshall (Canon) on Jan 09, 2011 at 04:49 UTC
I think you have mis-understood the comment about prototypes. I suspect that you probably didn't even know that you were declaring a prototype. The simple explanation is: when you define a sub X, do not put parens, () after the name. That's it. sub X(){} means something very different than just sub X{}. I would go as far as to say that you never have to, and normally should not put any (....stuff...) after the sub's name. -What you have done with shift is 100% correct. -Maybe chomp() is not necessary, but it doesn't "hurt". -A more important point for me is to indent the lines within the subroutine by either 3 or 4 spaces. "Prototype failure" example: #!/usr/bin/perl -w use strict; # This sequence works, although with a warning... # because Perl hasn't yet seen subroutine X. X("xyz"); sub X() # this means that subroutine X cannot # be called with any arugment at all. # sub X(); #is ok, # sub X("abc"); #is not ok. { my $input = shift; print "$input\n"; } #this would fail to produce a result - program fails to compile # X("abc"); # because now that subroutine X() has been seen, it is understood # that no arguments can be passed to it. __END__ prints: main::X() called too early to check prototype at C:\TEMP\prototypes.pl + line 4. xyz [download]	[reply] [d/l]
Re^4: Regex with multiple pattern omissions by jhoop (Acolyte) on Jan 10, 2011 at 14:40 UTC
Re^5: Regex with multiple pattern omissions by Marshall (Canon) on Jan 11, 2011 at 09:21 UTC
Re: Regex with multiple pattern omissions by Anonymous Monk on Jan 09, 2011 at 00:22 UTC
See Search::QueryBuilder and esp SEE ALSO	[reply]
Re^2: Regex with multiple pattern omissions by jhoop (Acolyte) on Jan 09, 2011 at 00:51 UTC
interesting! will explore...	[reply]
Re: Regex with multiple pattern omissions by Marshall (Canon) on Jan 09, 2011 at 00:22 UTC
I'm having a bit of trouble understanding the question. It sounds like your routine does what you want? Or not? If not then what else should it do? If you are asking for a "better" way to accomplish what you already have, I would say don't bother. What you have so far is reasonable. It appears to me that you have a clear algorithm that you understand.	[reply]
Re^2: Regex with multiple pattern omissions by jhoop (Acolyte) on Jan 09, 2011 at 00:46 UTC
Thanks. Sorry if I was unclear. What I have does several things that I want, but not everything. I would like to augment the match conditions to keep anything between "" together as one string, and to omit the control words regardless of case.. I am wondering, if I declare @omissions before the match statement, is it possible to ?! its contents in the match expression, having the m/.../i case-insensitivity apply to the contents of @omissions (eventually @omissions will be user-defined and might contain different things). Also, while checking for dupes after the fact is fine for small arrays (and I'm generally ok leaving it this way) i was wondering if there's a neat (more efficient) way to do it as each matched term is added, in case the input list is huge..	[reply]
Re^3: Regex with multiple pattern omissions by Marshall (Canon) on Jan 09, 2011 at 00:57 UTC
Thanks, this is a lot more clear now! 1. One of the very cool things about Perl is that you can build regexes dynamically - this works great. So this can play into the eventual plan for @omissions. 2. Using hash table like you have is a very Perl way to remove dupes. This will work fine even for bigger arrays. Need to noodle on the regex part of your question...	[reply]
Re^2: Regex with multiple pattern omissions by jhoop (Acolyte) on Jan 09, 2011 at 00:59 UTC
the immediate issue is that, in the current output given - oR, Or, and NOT should be omitted (in this case "and" is the only control-word from the input string that was correctly omitted) and also, "non" and "volatile" should remain together in the output	[reply]
Re^3: Regex with multiple pattern omissions by jhoop (Acolyte) on Jan 09, 2011 at 01:08 UTC
eep. i meant "non$volatile display" should remain together in the output	[reply]
Re^4: Regex with multiple pattern omissions by Marshall (Canon) on Jan 09, 2011 at 01:21 UTC
Re^5: Regex with multiple pattern omissions by jhoop (Acolyte) on Jan 10, 2011 at 14:28 UTC
Re: Regex with multiple pattern omissions by AnomalousMonk (Archbishop) on Jan 13, 2011 at 01:22 UTC
A slightly different approach occurred to me. The alternation in the code below 'looks for' (and steps over) everything, even the stuff you want to ignore, but only returns (as a list) those patterns that are captured. The items to be ignored must be first in the alternation! Use of capture groups in an alternation has the side-effect of producing a bunch of undefined list items because every capture group always produces an output even if the output is undefined because the group was not 'visited' in the alternation. This is easily dealt with by grepping for defined values. use warnings; use strict; use List::MoreUtils qw(uniq); # extract these. my $d_quoted = qr{ [^"]* }xms; # body of "-quoted sub-string my $searchterms = qr{ [[:alnum:]\$]+ }xms; # ignore these. my $dotted = qr{ \. [[:alpha:]]+ \. }xms; my $control = qr{ terms \| and \| or \| not \| with \| near \| same \| xor \| adj }xmsi; # note /i case insensitive my $ignore = qr{ $dotted \| $control }xms; my $test = q{"non$volatile display" and ((timer oR count$3 Or } . q{display) near5 hour).ccls. NOT (LCD).ab.}; my @output = sort uniq grep { defined } $test =~ m{ $ignore \| ($searchterms) \| " ($d_quoted) " }xmsg ; print qq{'$test' \n}; print qq{'$_' } for @output; [download] Output: `'"non$volatile display" and ((timer oR count$3 Or display) near5 hour) +.ccls. NOT (LCD).ab.' '5' 'LCD' 'count$3' 'display' 'hour' 'non$volatile display' 'timer'` [download]	[reply] [d/l] [select]