Ya has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I got stuck up on this regular expression. The file is a multi line file with a Name=Value record on each line. The Value is a comma seperated record with multiple fields. missing fields are represented by continous commas. I have a pattern in the middle of each line, which is as follows Word ResGen preceded by a comma which is preceded by a comma OR any word and a comma which is preceded by a comma OR any word and a comma Which may or may not be preceded by a ". I need to see weather a " quote is present or not present. If " is not present add a " in this location. I am including some of the sample lines of this file. The example lines do not show the exhaustive list of possibilites. Example sample lines
GENE1="Rattus norvegicus serum and glucocorticoid-regulated kinase (sg +k) mRNA, complete cds,NM_019232,333,ResGen,ATP binding|protein serine +/threonine kinase|protein amino acid phosphorylation,,,,29517 GENE2="ESTs, Weakly similar to putative serine/threonine protein kinas +e MAK-V [M.musculus]",NM_144755,331,ResGen,,,,,246273 GENE3="Thiosulfate sulphurtransferase (rhodanese)",X56228,329,ResGen,m +itochondrion|sulfate transport|thiosulfate sulfurtransferase,,,,25274 GENE4="Spleen tyrosine kinase,NM_012758,327,ResGen,ATP binding|protein + tyrosine kinase|intracellular signaling cascade|protein amino acid p +hosphorylation,,,,25155 GENE5="Spleen kinase 24,NM_012758,,ResGen,ATP binding|protein tyrosine + kinase|intracellular signaling cascade|protein amino acid phosphoryl +ation,,,,25155
Thanks & Best regards to all PerlMonks vemana
  • Comment on Need method to create Regular expression for known pattern in the middle of a line
  • Download Code

Replies are listed 'Best First'.
Re: Need method to create Regular expression for known pattern in the middle of a line
by tachyon (Chancellor) on Feb 10, 2003 at 07:58 UTC

    Just use split. You are just looking 3 commas back in the list so:

    while(<DATA>) { my ($name, $data) = split "="; my @data = split ',', $data; for my $i ( 0.. $#data ) { next unless $data[$i] eq 'ResGen'; # found a ResGen so see what we had 3 commas ago (do bounds chec +k to0) next if $i -3 < 0; my $back_a_bit = $data[$i -3]; print chop($back_a_bit) eq '"' ? "$name: Found quote\n" : "$name +: No quote\n"; } } __DATA__ GENE1="Rattus norvegicus serum and glucocorticoid-regulated kinase (sg +k) mRNA, complete cds,NM_019232,333,ResGen,ATP binding|protein serine +/threonine kinase|protein amino acid phosphorylation,,,,29517 GENE2="ESTs, Weakly similar to putative serine/threonine protein kinas +e MAK-V [M.musculus]",NM_144755,331,ResGen,,,,,246273 GENE3="Thiosulfate sulphurtransferase (rhodanese)",X56228,329,ResGen,m +itochondrion|sulfatetransport|thiosulfate sulfurtransferase,,,,25274 GENE4="Spleen tyrosine kinase,NM_012758,327,ResGen,ATP binding|protein + tyrosine kinase|intracellular signaling cascade|protein amino acid p +hosphorylation,,,,25155 GENE5="Spleen kinase 24,NM_012758,,ResGen,ATP binding|protein tyrosine + kinase|intracellular signaling cascade|protein amino acid phosphoryl +ation,,,,25155 __END__ GENE1: No quote GENE2: Found quote GENE3: Found quote GENE4: No quote GENE5: No quote

    Update

    Fixed typo thanks to Hoffmator

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Need method to create Regular expression for known pattern in the middle of a line
by BrowserUk (Patriarch) on Feb 10, 2003 at 08:18 UTC

    This appears to work in the way I interpret your description, though there is at least one ambiguity in there so I may have dwiw'd the wrong way.

    #! perl -slw use strict; my $re_resgen = qr[(.)(,(?:\w+),(?:\w+),ResGen)]; while (<DATA>) { s[$re_resgen][my $t=$1; $t.='"' unless $t eq '"'; $t.$2]e; print; } =pod output c:\test>234040 001 GENE1="Rattus norvegicus serum and glucocorticoid-regulated kinase + (sgk) mRNA, complete cds",NM_019232,333,ResGen,ATP binding|pr otein serine/threonine kinase|protein amino acid phosphorylation,,,,29 +517 002 GENE2="ESTs, Weakly similar to putative serine/threonine protein k +inase MAK-V [M.musculus]",NM_144755,331,ResGen,,,,,246273 003 GENE3="Thiosulfate sulphurtransferase (rhodanese)",X56228,329,ResG +en,mitochondrion|sulfate transport| thiosulfate sulfurtransfer ase,,,,25274 004 GENE4="Spleen tyrosine kinase",NM_012758,327,ResGen,ATP binding|pr +otein tyrosine kinase|intracellular signaling cascade|protein amino acid phosphorylation,,,,25155 005 GENE5="Spleen kinase 24,NM_012758,,ResGen,ATP binding|protein tyro +sine kinase|intracellular signaling cascade|protein amino acid phosphorylation,,,,25155 =cut __DATA__ 001 GENE1="Rattus norvegicus serum and glucocorticoid-regulated kinase + (sgk) mRNA, complete cds,NM_019232,333,ResGen,ATP binding|protein se +rine/threonine kinase|protein amino acid phosphorylation,,,,29517 002 GENE2="ESTs, Weakly similar to putative serine/threonine protein k +inase MAK-V [M.musculus]",NM_144755,331,ResGen,,,,,246273 003 GENE3="Thiosulfate sulphurtransferase (rhodanese)",X56228,329,ResG +en,mitochondrion|sulfate transport| thiosulfate sulfurtransferase,,,, +25274 004 GENE4="Spleen tyrosine kinase,NM_012758,327,ResGen,ATP binding|pro +tein tyrosine kinase|intracellular signaling cascade|protein amino ac +id phosphorylation,,,,25155 005 GENE5="Spleen kinase 24,NM_012758,,ResGen,ATP binding|protein tyro +sine kinase|intracellular signaling cascade|protein amino acid phosph +orylation,,,,25155

    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

      I am sorry to have not mentioned this earlier that, fields may be delimited with dobule quotes only on one side and may conatin commas as part of field content.

      My Problem is to identify fields which were delimited only on side with double quote but may or may not have commas in them. And then to delimit them in double quotes.

      regards
      Ya

        A few observations:

        It appears from your supplied sample of data that each record has nine comma delimited fields. Although some are blank, and some of them do themselves contain commas.

        Where there are commas within a field, they are always followed by a space.

        There are no spaces in either side of a real comma delimiter.

        From your sample, only the first field appears to exhibit the unbalanced quotes problem.

        If you can confirm these observations as facts, it would allow a more generic solution to be provided?

        Is it your intention to only fix up the missing quotes where they are unbalanced, or do you also want to place quotes around other fields that contain spaces and thereby make your data readable using standard csv handlers?


        Examine what is said, not who speaks.

        The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: Need method to create Regular expression for known pattern in the middle of a line
by Enlil (Parson) on Feb 10, 2003 at 07:52 UTC
    This should suit your needs (it only adds the " if the pattern matches otherwise it leaves it alone.):
    use strict; use warnings; while ( <DATA> ) { s/([^"])(,\w*,\w*,ResGen,)/$1"$2/; print; } __DATA__ GENE1="Rattus norvegicus serum and glucocorticoid-regulated kinase (sg +k) mRNA, complete cds,NM_019232,333,ResGen,ATP binding|protein serine +/threonine kinase|protein amino acid phosphorylation,,,,29517 GENE2="ESTs, Weakly similar to putative serine/threonine protein kinas +e MAK-V [M.musculus]",NM_144755,331,ResGen,,,,,246273 GENE3="Thiosulfate sulphurtransferase (rhodanese)",X56228,329,ResGen,m +itochondrion|sulfate transport|thiosulfate sulfurtransferase,,,,25274 GENE4="Spleen tyrosine kinase,NM_012758,327,ResGen,ATP binding|protein + tyrosine kinase|intracellular signaling cascade|protein amino acid p +hosphorylation,,,,25155 GENE5="Spleen kinase 24,NM_012758,,ResGen,ATP binding|protein tyrosine + kinase|intracellular signaling cascade|protein amino acid phosphoryl +ation,,,,25155
    This assumes a lot from your data. Mainly that the only thing possible between the commas lie in the character class [A-Za-z0-9_]

    -enlil

      The problem with your regex (which may or may not be a real problem) is that the \w* limit what is included - a ( or | etc will cause an aberrant failure. It would be much more robust to use [^,]* as this includes everything except the comma separator....

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

        I am sorry to have not mentioned this earlier that, fields may be delimited with dobule quotes only on one side and may conatin commas as part of field content.

        My Problem is to identify fields which were delimited only on side with double quote but may or may not have commas in them. And then to delimit them in double quotes.

        regards Ya