in reply to Need method to create Regular expression for known pattern in the middle of a line

This appears to work in the way I interpret your description, though there is at least one ambiguity in there so I may have dwiw'd the wrong way.

#! perl -slw use strict; my $re_resgen = qr[(.)(,(?:\w+),(?:\w+),ResGen)]; while (<DATA>) { s[$re_resgen][my $t=$1; $t.='"' unless $t eq '"'; $t.$2]e; print; } =pod output c:\test>234040 001 GENE1="Rattus norvegicus serum and glucocorticoid-regulated kinase + (sgk) mRNA, complete cds",NM_019232,333,ResGen,ATP binding|pr otein serine/threonine kinase|protein amino acid phosphorylation,,,,29 +517 002 GENE2="ESTs, Weakly similar to putative serine/threonine protein k +inase MAK-V [M.musculus]",NM_144755,331,ResGen,,,,,246273 003 GENE3="Thiosulfate sulphurtransferase (rhodanese)",X56228,329,ResG +en,mitochondrion|sulfate transport| thiosulfate sulfurtransfer ase,,,,25274 004 GENE4="Spleen tyrosine kinase",NM_012758,327,ResGen,ATP binding|pr +otein tyrosine kinase|intracellular signaling cascade|protein amino acid phosphorylation,,,,25155 005 GENE5="Spleen kinase 24,NM_012758,,ResGen,ATP binding|protein tyro +sine kinase|intracellular signaling cascade|protein amino acid phosphorylation,,,,25155 =cut __DATA__ 001 GENE1="Rattus norvegicus serum and glucocorticoid-regulated kinase + (sgk) mRNA, complete cds,NM_019232,333,ResGen,ATP binding|protein se +rine/threonine kinase|protein amino acid phosphorylation,,,,29517 002 GENE2="ESTs, Weakly similar to putative serine/threonine protein k +inase MAK-V [M.musculus]",NM_144755,331,ResGen,,,,,246273 003 GENE3="Thiosulfate sulphurtransferase (rhodanese)",X56228,329,ResG +en,mitochondrion|sulfate transport| thiosulfate sulfurtransferase,,,, +25274 004 GENE4="Spleen tyrosine kinase,NM_012758,327,ResGen,ATP binding|pro +tein tyrosine kinase|intracellular signaling cascade|protein amino ac +id phosphorylation,,,,25155 005 GENE5="Spleen kinase 24,NM_012758,,ResGen,ATP binding|protein tyro +sine kinase|intracellular signaling cascade|protein amino acid phosph +orylation,,,,25155

Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

  • Comment on Re: Need method to create Regular expression for known pattern in the middle of a line
  • Download Code

Replies are listed 'Best First'.
Re: Re: Need method to create Regular expression for known pattern in the middle of a line
by Ya (Initiate) on Feb 10, 2003 at 08:40 UTC
    I am sorry to have not mentioned this earlier that, fields may be delimited with dobule quotes only on one side and may conatin commas as part of field content.

    My Problem is to identify fields which were delimited only on side with double quote but may or may not have commas in them. And then to delimit them in double quotes.

    regards
    Ya

      A few observations:

      It appears from your supplied sample of data that each record has nine comma delimited fields. Although some are blank, and some of them do themselves contain commas.

      Where there are commas within a field, they are always followed by a space.

      There are no spaces in either side of a real comma delimiter.

      From your sample, only the first field appears to exhibit the unbalanced quotes problem.

      If you can confirm these observations as facts, it would allow a more generic solution to be provided?

      Is it your intention to only fix up the missing quotes where they are unbalanced, or do you also want to place quotes around other fields that contain spaces and thereby make your data readable using standard csv handlers?


      Examine what is said, not who speaks.

      The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.