in reply to Need method to create Regular expression for known pattern in the middle of a line

This should suit your needs (it only adds the " if the pattern matches otherwise it leaves it alone.):
use strict; use warnings; while ( <DATA> ) { s/([^"])(,\w*,\w*,ResGen,)/$1"$2/; print; } __DATA__ GENE1="Rattus norvegicus serum and glucocorticoid-regulated kinase (sg +k) mRNA, complete cds,NM_019232,333,ResGen,ATP binding|protein serine +/threonine kinase|protein amino acid phosphorylation,,,,29517 GENE2="ESTs, Weakly similar to putative serine/threonine protein kinas +e MAK-V [M.musculus]",NM_144755,331,ResGen,,,,,246273 GENE3="Thiosulfate sulphurtransferase (rhodanese)",X56228,329,ResGen,m +itochondrion|sulfate transport|thiosulfate sulfurtransferase,,,,25274 GENE4="Spleen tyrosine kinase,NM_012758,327,ResGen,ATP binding|protein + tyrosine kinase|intracellular signaling cascade|protein amino acid p +hosphorylation,,,,25155 GENE5="Spleen kinase 24,NM_012758,,ResGen,ATP binding|protein tyrosine + kinase|intracellular signaling cascade|protein amino acid phosphoryl +ation,,,,25155
This assumes a lot from your data. Mainly that the only thing possible between the commas lie in the character class [A-Za-z0-9_]

-enlil

  • Comment on Re: Need method to create Regular expression for known pattern in the middle of a line
  • Select or Download Code

Replies are listed 'Best First'.
Re: Re: Need method to create Regular expression for known pattern in the middle of a line
by tachyon (Chancellor) on Feb 10, 2003 at 08:03 UTC

    The problem with your regex (which may or may not be a real problem) is that the \w* limit what is included - a ( or | etc will cause an aberrant failure. It would be much more robust to use [^,]* as this includes everything except the comma separator....

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      I am sorry to have not mentioned this earlier that, fields may be delimited with dobule quotes only on one side and may conatin commas as part of field content.

      My Problem is to identify fields which were delimited only on side with double quote but may or may not have commas in them. And then to delimit them in double quotes.

      regards Ya

        The only change you need to make to the code I suggested would be to to use Text::CSV or similar to split your CSV elements up. This will correctly deal with commas,"within, quotes,",,,and,only,split,on,the,unquoted,commas

        use Text::CSV; my $csv = Text::CSV->new(); while(<DATA>) { my ($name, $data) = split "="; $csv->parse($data); my @data = $csv->fields(); for my $i ( 0.. $#data ) { next unless $data[$i] eq 'ResGen'; # found a ResGen so see what we had 3 commas ago (do bounds chec +k to0) next if $i -3 < 0; my $back_a_bit = $data[$i -3]; print chop($back_a_bit) eq '"' ? "$name: Found quote\n" : "$name +: No quote\n"; } }

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print