jhoop has asked for the wisdom of the Perl Monks concerning the following question:

Hello, thanks to all for the wealth of knowledge here, it has been invaluable.. I have an issue that I can't seem to crack with my limited knowledge or searching.. Trying to parse through search-strings and extract all the non-control words. My input search-string might look like this:

"non$volatile display" and ((timer oR count$3 Or display) near5 hour).ccls. NOT (LCD).ab.

I would like to parse similar strings, extracting the search terms, and ignoring the control words (and their case) and everything between two periods, like .ccls. I would also like to preserve the wildcards and anything in "", like "non$volatile display" (where the $ can be anything, I could just keep the $.. and store anything between "" as a single string in the output. The output would be an array of the extracted substrings. Also, if there is a more efficient way to remove the dupes in this routine, I'm all ears..

My code so far is below - it manages to pull out all the abc substrings, ignoring lower-case control words and anything between periods.. Any thoughts?

sub extract_terms(){ my $input = shift; chomp $input; my @searchterms = ($input =~ m/\b(?!\.)[a-z]+(?!\.)\b/gi); my @omissions = qw(terms and or not with near same xor adj); my %h; @h{@omissions} = undef; @searchterms = grep {not exists $h{$_}} @searchterms; return @searchterms; }

Which outputs (after sorting):

count, display, hour, LCD, non, NOT, Or, oR, timer, volatile,

Replies are listed 'Best First'.
Re: Regex with multiple pattern omissions
by jwkrahn (Abbot) on Jan 09, 2011 at 01:33 UTC
      I think I just copied the starting lines of this subroutine in from one of the other subroutines, chomp doesn't need to be there. I'm pretty new to this, and so may be using shift incorrectly/unnecessarily, but this has worked so far. The input for this subroutine is a block of text consisting of a (long) list of \n-separated search-strings. For this routine, I don't need them split, as I'm extracting the pertinent terms from ALL searches and compiling an alphabetized list of non-dupes.
        I think you have mis-understood the comment about prototypes. I suspect that you probably didn't even know that you were declaring a prototype.

        The simple explanation is: when you define a sub X, do not put parens, () after the name.
        That's it.
        sub X(){} means something very different than just sub X{}.
        I would go as far as to say that you never have to, and normally should not put any (....stuff...) after the sub's name.

        -What you have done with shift is 100% correct.
        -Maybe chomp() is not necessary, but it doesn't "hurt".
        -A more important point for me is to indent the lines within the subroutine by either 3 or 4 spaces.

        "Prototype failure" example:

        #!/usr/bin/perl -w use strict; # This sequence works, although with a warning... # because Perl hasn't yet seen subroutine X. X("xyz"); sub X() # this means that subroutine X cannot # be called with any arugment at all. # sub X(); #is ok, # sub X("abc"); #is not ok. { my $input = shift; print "$input\n"; } #this would fail to produce a result - program fails to compile # X("abc"); # because now that subroutine X() has been seen, it is understood # that no arguments can be passed to it. __END__ prints: main::X() called too early to check prototype at C:\TEMP\prototypes.pl + line 4. xyz
Re: Regex with multiple pattern omissions
by Anonymous Monk on Jan 09, 2011 at 00:22 UTC
      interesting! will explore...
Re: Regex with multiple pattern omissions
by Marshall (Canon) on Jan 09, 2011 at 00:22 UTC
    I'm having a bit of trouble understanding the question. It sounds like your routine does what you want? Or not? If not then what else should it do? If you are asking for a "better" way to accomplish what you already have, I would say don't bother. What you have so far is reasonable. It appears to me that you have a clear algorithm that you understand.
      Thanks. Sorry if I was unclear. What I have does several things that I want, but not everything. I would like to augment the match conditions to keep anything between "" together as one string, and to omit the control words regardless of case.. I am wondering, if I declare @omissions before the match statement, is it possible to ?! its contents in the match expression, having the m/.../i case-insensitivity apply to the contents of @omissions (eventually @omissions will be user-defined and might contain different things). Also, while checking for dupes after the fact is fine for small arrays (and I'm generally ok leaving it this way) i was wondering if there's a neat (more efficient) way to do it as each matched term is added, in case the input list is huge..
        Thanks, this is a lot more clear now!

        1. One of the very cool things about Perl is that you can build regexes dynamically - this works great. So this can play into the eventual plan for @omissions.

        2. Using hash table like you have is a very Perl way to remove dupes. This will work fine even for bigger arrays.

        Need to noodle on the regex part of your question...

      the immediate issue is that, in the current output given - oR, Or, and NOT should be omitted (in this case "and" is the only control-word from the input string that was correctly omitted) and also, "non" and "volatile" should remain together in the output
        eep. i meant "non$volatile display" should remain together in the output
Re: Regex with multiple pattern omissions
by AnomalousMonk (Archbishop) on Jan 13, 2011 at 01:22 UTC

    A slightly different approach occurred to me. The alternation in the code below 'looks for' (and steps over) everything, even the stuff you want to ignore, but only returns (as a list) those patterns that are captured. The items to be ignored must be first in the alternation! Use of capture groups in an alternation has the side-effect of producing a bunch of undefined list items because every capture group always produces an output even if the output is undefined because the group was not 'visited' in the alternation. This is easily dealt with by grepping for defined values.

    use warnings; use strict; use List::MoreUtils qw(uniq); # extract these. my $d_quoted = qr{ [^"]* }xms; # body of "-quoted sub-string my $searchterms = qr{ [[:alnum:]\$]+ }xms; # ignore these. my $dotted = qr{ \. [[:alpha:]]+ \. }xms; my $control = qr{ terms | and | or | not | with | near | same | xor | adj }xmsi; # note /i case insensitive my $ignore = qr{ $dotted | $control }xms; my $test = q{"non$volatile display" and ((timer oR count$3 Or } . q{display) near5 hour).ccls. NOT (LCD).ab.}; my @output = sort uniq grep { defined } $test =~ m{ $ignore | ($searchterms) | " ($d_quoted) " }xmsg ; print qq{'$test' \n}; print qq{'$_' } for @output;

    Output:

    '"non$volatile display" and ((timer oR count$3 Or display) near5 hour) +.ccls. NOT (LCD).ab.' '5' 'LCD' 'count$3' 'display' 'hour' 'non$volatile display' 'timer'