Dear Monks,

Program Input:

Following are of interest: **carboxypeptidase** protein $$inhibitor$$ ( **CI** ) , **nanopeptidase** kinase $$inhibitor$$ , **NI** , and others such as , **p(57)** and **polypeptidase** protein $$inhibitor$$ ( **PI** ).

Program Output:

1. Following are of interest: **carboxypeptidase_protein_inhibitor_(CI)** , **nanopeptidase_kinase_inhibitor_(NI)** and others such as , **p(57)** and **polypeptidase_protein_inhibitor_(PI)**.

2. Following are of interest: **carboxypeptidase** protein $$inhibitor$$ ( **CI** ) , nanopeptidase kinase inhibitor , NI , and others such as , p(57) and polypeptidase protein inhibitor ( PI ).

3. Following are of interest: carboxypeptidase protein inhibitor ( CI ) , **nanopeptidase** kinase $$inhibitor$$ , **NI** , and others such as , p(57) and polypeptidase protein inhibitor ( PI ).

4. Following are of interest: carboxypeptidase protein inhibitor ( CI ) , nanopeptidase kinase inhibitor , NI , and others such as , p(57) and **polypeptidase** protein $$inhibitor$$ ( **PI** ).

While I can achieve output 1. using the regular expression substitution as shown below, I cannot figure out how output sentences 2,3 and 4 could be achieved.

if ($line =~ /\*\*([^\*]+)\*\*\s(kinase|isoform|protein|peptide|li +gand)\s\$\$([^\$]+)\$\$\s[\(\,]\s\*\*([^\*]+)\*\*\s[\)\,]/) { $line =~ s/\*\*([^\*]+)\*\*\s(kinase|isoform|protein|peptide|l +igand)\s\$\$([^\$]+)\$\$\s[\(\,]\s\*\*([^\*]+)\*\*\s[\)\,]/**$1_$2_$3 +_($4)**/g; print WF "$line\n"; }

While output sentence 1 represents the original sentence with all substitutions using the above code (there are 3 substitutions in this example although this number can vary with the sentence).

Each of the other remaining output sentences (e.g. 2,3 and 4) are the original input sentence, except that, the original pattern is retained in the sentence at the substitution location, while the tags in the sentence (i.e. ** and $$) are removed from all other places in the sentence. The number of such output sentences thus will be equal to the number of patterns substituted using the regex above (which is 3 in this example because there are 3 pattern substituted as shown in output 1.). Is there a nice way of doing this (getting outputs 2,3 and 4)?

Appreciate your help.

Thanks very much in advance.


In reply to regex pattern match problem by newbio

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.