in reply to Regex with multiple pattern omissions

I'm having a bit of trouble understanding the question. It sounds like your routine does what you want? Or not? If not then what else should it do? If you are asking for a "better" way to accomplish what you already have, I would say don't bother. What you have so far is reasonable. It appears to me that you have a clear algorithm that you understand.
  • Comment on Re: Regex with multiple pattern omissions

Replies are listed 'Best First'.
Re^2: Regex with multiple pattern omissions
by jhoop (Acolyte) on Jan 09, 2011 at 00:46 UTC
    Thanks. Sorry if I was unclear. What I have does several things that I want, but not everything. I would like to augment the match conditions to keep anything between "" together as one string, and to omit the control words regardless of case.. I am wondering, if I declare @omissions before the match statement, is it possible to ?! its contents in the match expression, having the m/.../i case-insensitivity apply to the contents of @omissions (eventually @omissions will be user-defined and might contain different things). Also, while checking for dupes after the fact is fine for small arrays (and I'm generally ok leaving it this way) i was wondering if there's a neat (more efficient) way to do it as each matched term is added, in case the input list is huge..
      Thanks, this is a lot more clear now!

      1. One of the very cool things about Perl is that you can build regexes dynamically - this works great. So this can play into the eventual plan for @omissions.

      2. Using hash table like you have is a very Perl way to remove dupes. This will work fine even for bigger arrays.

      Need to noodle on the regex part of your question...

Re^2: Regex with multiple pattern omissions
by jhoop (Acolyte) on Jan 09, 2011 at 00:59 UTC
    the immediate issue is that, in the current output given - oR, Or, and NOT should be omitted (in this case "and" is the only control-word from the input string that was correctly omitted) and also, "non" and "volatile" should remain together in the output
      eep. i meant "non$volatile display" should remain together in the output
        Is this headed in the right direction?
        Meaning does this code produce the output that you want? (given your single test case)...Program specifications are hard to write and I think the best way forward here is just refinement by example.
        #!/usr/bin/perl -w use strict; my $input = '"non$volatile display" and ((timer oR count$3 Or display) + near5 hour).ccls. NOT (LCD).ab.'; my @omissions = qw(terms and or not with near same xor adj); my $omit = join("|",@omissions); $input =~ s/\..*?\.//g; $input =~ s/$omit//ig; my @searchterms = ($input =~ m/".+?"|[a-zA-Z][\w\$]+/gi); print "@searchterms"; #prints: #"non$volatile display" timer count$3 display hour LCD
        update:
        -I think that you mean for these .xyz. terms to be deleted?
        -Above doesn't allow for terms in @omissions to be taken absolutely literally. I need to look in Larry's book for the syntax. But this does show a dynamic regex. Also probably need to take into account that omit words should be on boundaries (whole words - not words within words, the \b - look in Larry's book)
        -Running substitute operations can take some time as the string is modified after each one, but this may or may not matter time-wise.
        -The question right now: is this is "right" output? I mean for this given single input case?

        As a general approach, I try to break these complex things into multiple easier steps. Get the right output, then tweak it if performance is not adequate.