in reply to Regexps for Parsing Brackets in Chemical Formulae

I think you've got the right idea using 1 while s///, because you're matching from inside out rather than left to right. Here's one way to do the whole substitution all at once: 1 while s/\(([^\(\)]+)\)((?:\d+)?)/ $1 x ($2 || 1) /ge; This matches a parenthesized substring that does not itself contain any parenthesizes, and optionally a subsequent number, and replaces it with the substring, minus the parentheses, repeated the appropriate number of times.

Replies are listed 'Best First'.
Re: Re: Regexps for Parsing Brackets in Chemical Formulae
by Elgon (Curate) on Nov 03, 2001 at 22:20 UTC

    Many thanks to Chipmunk and other folks,

    I'll go away and play with these suggestions, which seem quite groovy (insofar as I can tell which ain't that far!) The reason for all of this is sort of related to my final-year project but not actually included in it (the project is in PHP): My tutor wrote a routine to do this kind of thing, which took him ages in some other language and I'm trying to introduce him to the power of Perl (and by extension, Perlmonks.)

    In the virtual bar of pm I owe you all a pint.

    Elgon

    "Without evil there can be no good, so it must be good to be evil sometimes.
    --Satan, South Park: Bigger, Longer, Uncut.

      You were close. That should do it:
      use strict; my %count; # added gratuitous parentheses for embedded formula testing sake. $_='Mo(P(H)3)4(CO)(NH2C2(H)5)'; # at each iteration do subformula with rigtmost left parenthesis. # quit when no more parenthesis s/(.*)\((.*?)\)(\d*)/$1 . $2 x ($3 ? $3 : 1) /e while m/\(/; s/([A-Z](?:[a-z])?)(\d*)/ $count{$1} += $2 ? $2 : 1 ;''/eg; printf "%-2s %3d\n", $_, $count{$_} for sort keys %count;
      It prints:
      C 3 H 19 Mo 1 N 1 O 1 P 4

      -- stefp

        Stefp,

        Muchas gracias - one minor alteration to take account of the fact that certain artificial elements have, under certain nomenclatures, three letters rather than one or two...

        s/([A-Z](?:[a-z]{0,2})?)(\d*)/  $count{$1} += $2 ? $2  : 1 ;''/eg;

        Otherwise, perfect!

        Ta, Elgon.

        "Without evil there can be no good, so it must be good to be evil sometimes.
        --Satan, South Park: Bigger, Longer, Uncut.

Re: Re: Regexps for Parsing Brackets in Chemical Formulae
by monkfish (Pilgrim) on Nov 03, 2001 at 20:50 UTC
    Chipmunk, nice solution, but you don't need to escape the ()'s in the brackets. They are treated as literals inside brackets.

    -monkfish (The Fishy Monk)