Archana has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I am badly stuck up with extraction. Have a query like this:- ((transcription AND factor)NOT(control OR regulat) AND (TATA OR TBP) NOT (BLOOD AND marrow)) I want NOT terms together in an array(i.e control OR regulat)(BLOOD AND marrow)) and other terms in an array how do i do that? I am new to regular expression. pls help me!!
  • Comment on Regular expreesion for extraction of words

Replies are listed 'Best First'.
Re: Regular expreesion for extraction of words
by FunkyMonk (Bishop) on Aug 29, 2007 at 10:43 UTC
    How about:

    $_ = '((transcription AND factor)NOT(control OR regulat) AND (TATA OR +TBP) NOT (BLOOD AND marrow))'; my @caps = m/NOT # "NOT" \s* # spaces? ( # start capturing \( # "(" [^)]+ # some not ")"s \) # ")" ) # end capture /xg; print join "\n", @caps;

    Output:

    (control OR regulat) (BLOOD AND marrow)

    See perlretut for a tutorial and perlre for the details on regular expressions.

    A reply falls below the community's threshold of quality. You may see it by logging in.
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Regular expreesion for extraction of words
by moritz (Cardinal) on Aug 29, 2007 at 10:45 UTC
    Perhaps SQL::Parser might help you.

    If you want to use regexes, you could try something like this:

    my $op = qr/AND|OR|NOT/; my $block = qr/($op?)\s*\((\w\s+$op\s+\w)\)/; if ($str =~ m/^\(\s*(?:$block\s*)\)/){ # access $1, $2, ... here }
Re: Regular expreesion for extraction of words
by throop (Chaplain) on Aug 29, 2007 at 12:21 UTC
    Dearest Archana,

    Tell us more about these oddly-formed logical expressions:
    Is case significant?
    Is (transcription AND factor)NOT(control OR regulat) to be understood as (transcription AND factor)AND NOT(control OR regulat)
    Will there be other operators besides AND, OR and NOT? (eg XOR, !, NAND)
    How should NOT scope? eg, given (NOT blood AND soil) what do you want extracted?
    Where will these odd expressions be coming from? User type-in?
    What do you want done with unbalanced parens?

    throop

      Dear throop,
      Yes it is case sensitive.

      (transcription AND factor) NOT (control OR regulat)
      means both the words transcription,factor has to be present but not control,regulat.

      No 3 operators AND,OR,NOT is used.

      given (NOT blood AND soil) what do you want extracted?

      NOT should not be the first i.e only words should be the first term followed by NOT|OR|AND.

      Actually i have a text file with some paragraph which includes these words.

      If user enters like this

      ((transcription AND factor) NOT (control OR regulat) AND (promoter) NO +T(TATABOX))

      It should get the words (transcription AND factor) AND (promoter) to be highlighted not the terms control,regulat,tatabox.

      For that i have collect NOT terms in one array and other terms in other array.

      Dearest Archana,
      Tell us more about these oddly-formed logical expressions:
      Is case significant?
      Is (transcription AND factor) NOT (control OR regulat) to be understood as (transcription AND factor)AND NOT(control OR regulat)
      Will there be other operators besides AND, OR and NOT? (eg XOR, !, NAND)
      How should NOT scope? eg, given (NOT blood AND soil) what do you want extracted?
      Where will these odd expressions be coming from? User type-in?
      What do you want done with unbalanced parens?
      throop

      20070911 Janitored by Corion: Added formatting, code tags, as per Writeup Formatting Tips