Hi Folks,
I've got a dandy little regexp-related problem for you all: I am writing a little module which takes a molecular formula and converts it into a hash where the keys are a unique list of elemental constituents and the values are the number of atoms present in the molecule. Sounds easy - believe me, if you're as naff as I am at regexps it ain't!
So we have our formula in $formula... First we want to get rid of bracket pairs without coefficients next to them, so I though something like this...
1 while $formula =~ s/(\()(\[A-Za-z0-9()]+)(\)\D)/$2/e;
but this can be wrong as in some cases the maximal matching will chop out brackets which don't match...Help!
Then we want to swap out brackets which are followed by a two or more (if they're followed by 1 as a coefficient - and they shouldn't really be - then they effectively don't have a coefficient and should just have the brackets removed...) In this case we should multiply the what's inside the brackets when we multiply them out (which the following may or (more likely) may not do!)
1 while $formula =~ s/(\()(\[A-Za-z0-9()]+)(\)\)([0-9]+)/$4x$2/e;
Once these two tasks I have got a way of doing the rest but I cannot work out the correct regexps to do the above tasks - I just don't have the knowlege, the experience or a copy of "Mastering Regular Expressions"!
Just to clarify, if we have the following formula... Mo(PH3)4(CO)(NH2C2H5) for example, it should become... Mo(PH3)4CONH2C2H5 after the first regexp and then MoPH3PH3PH3PH3CONH2C2H5 at the end, which I can parse nicely myself. Note that if you have a series of brackets... (...(...)...(...)...) they need to be processed in the correct order, which really has me scratching my head I can tell you.
I will bow in deep respect to anyone who can give me a hand on this one as it has got me a bit stumped. (For the record it is not for an assessed piece of work - I am a chemist after all - but a mixture of general interest and boredom.) Virtual beer to you!
"Without evil there can be no good, so it must be good to be evil sometimes.
--Satan, South Park: Bigger, Longer, Uncut.
In reply to Regexps for Parsing Brackets in Chemical Formulae by Elgon
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |