Regular Expressions and atomic weights

hokie has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regular Expresions and atomic weights by Zaxo (Archbishop) on Jul 25, 2005 at 16:48 UTC
Chemistry::Mol seems to do all that, including parse the formula. `$ perl -MChemistry::File::Formula -e'my $mol = Chemistry::Mol->parse(" +Pb(CO3)2",format=>"formula");print $mol->mass,$/' 327.2178 $` [download] After Compline, Zaxo	[reply] [d/l]
Re: Regular Expresions and atomic weights by davidrw (Prior) on Jul 25, 2005 at 16:46 UTC
Looks like Chemistry::MolecularMass is exactly what you're looking for.	[reply]
Re: Regular Expressions and atomic weights by blokhead (Monsignor) on Jul 25, 2005 at 17:26 UTC
Your question has been answered well, but since my brain is in parsing mode, I thought I'd bring this up.. The reason it's hard to do this with a regex is that (barring (??{code}) directives in the regex) you can't match arbitrarily deep nested parentheses using regular expressions. Since your data format supports nesting things in parentheses, parsing is a better solution. Writing a parser for such simple notation is not that hard. What you can do to make it even easier is to combine the weight calculations with actual parsing. This is called syntax-directed evaluation. You don't see syntax-directed evaluation much in the parsing of programming languages, but for simpler expression languages where each part of the expression has a value, and you are parsing the expression for the sole purpose of computing its final value (think of a simple math expression calculator). `use Parse::RecDescent; use List::Util 'sum'; use vars '%weights'; %weights = qw( C 12 O 16 Pb 207 ); my $g = Parse::RecDescent->new(<<'END_GRAMMAR'); weight: compound { $item[1] } compound: group(s) { ::sum( @{$item[1]} ) } group: element /\d+/ { $item[1] * $item[2] } \| element { $item[1] } element: /[A-Z][a-z]*/ { $::weights{ $item[1] } } \| "(" compound")" { $item[2] } END_GRAMMAR print $g->weight("Pb(CO3)2"), $/; # prints 327` [download] This is probably what those other CPAN modules are doing. Actually, since they do more than just compute the weight, they probably parse the chemical formula into a tree structure first, and do the weight calculation on that tree. If you only do the weights, you can save yourself having to use an awkward intermediate tree representation. blokhead	[reply] [d/l]
Re^2: Regular Expressions and atomic weights by ikegami (Patriarch) on Jul 25, 2005 at 18:25 UTC
Below is a refactoring that avoids using stuff from `main`. Using stuff from `main` prevents you from precompiling your grammar into a module. Also, using stuff from `main` makes the script incompatible with mod_perl. (I also lined up the productions.) `use Parse::RecDescent; my $g = Parse::RecDescent->new(<<'END_GRAMMAR'); { use List::Util 'sum'; use vars '%weights'; %weights = qw( C 12 O 16 Pb 207 ); } weight : compound { $item[1] } compound : group(s) { sum( @{$item[1]} ) } group : element /\d+/ { $item[1] * $item[2] } \| element { $item[1] } element : /[A-Z][a-z]/ { $weights{ $item[1] } } \| "(" compound ")" { $item[2] } END_GRAMMAR print $g->weight("Pb(CO3)2"), $/; # prints 327` [download] I was planning on doing P::RD solution for the OP, but I abandonned the idea when others pointed to existing specialized modules. Thanks for filling in the gap. Update: The common start of both `group` productions is very innefficient. Fix: `weight : compound { $item[1] } compound : group(s) { sum( @{$item[1]} ) } group : element factor { $item[1] $item[2] } factor : /\d+/ { $item[1] } \| { 1 } element : /[A-Z][a-z]*/ { $weights{ $item[1] } } \| "(" compound ")" { $item[2] }` [download]	[reply] [d/l] [select]
Re^3: Regular Expressions and atomic weights by Your Mother (Archbishop) on Jul 25, 2005 at 19:26 UTC
That's hot™. Thanks to you and blokhead both for not ceasing to solve the problem. That's one of the more concise and edifying Parse::RecDescent examples I've seen.	[reply]
Re: Regular Expresions and atomic weights by ikegami (Patriarch) on Jul 25, 2005 at 16:47 UTC
One way is to convert the string into Perl code: `my %atom_weights = ( Pb => ..., C => ..., O => ..., ... ); $_ = "Pb(CO3)2"; print("$_\n"); s/([0-9]+)/$1/g; s/([A-Z][a-z])/ ($atom_weights{$1} or die("Bad element $1\n") ) . '+' /eg; s/\+(?=\\|$)//g; print("$_\n"); print(eval($_), "\n");` [download] Of course, using `eval` is dangerous unless you validate your input. Update: Fixed code. ` => +, ** => *`	[reply] [d/l] [select]
Re: Regular Expressions and atomic weights by polypompholyx (Chaplain) on Jul 25, 2005 at 19:31 UTC
I wrote a calculator module that does exactly this for chemical formula strings. It's my pet wheel-reinvention, but the RMM thing has actually been very useful (I'm a biochemistry lecturer). I would post the code, but it's a bit huge: just look in the `Chemistry.pm` module in the tarball. It's actually an extension to a more general calculator thing, but you'll probably find the Parse::RecDescent grammar useful: as other posters have said, a regex cannot parse general chemical formulae, because they are inherently nested (it's the same reason regexes can't be used to parse HTML in anything but the ugliest hacks). Some general things to consider are: Do you need the grammar to understand complicated things like Fe2(SO4)3.9H2O? If this answer to this is "yes", you need a `Parse::RecDescent`-style (context-free) grammar: regexes will not work. Does it need to understand common shorthands like Et, Me, Ph and Ac? Does it need to understand H, T, D and the hideous nomeclatural mess of the transactinides? You may find it easiest to think of the formulae as objects: each chemical element is a tiny hash-based object, so parsing 'H' would return something along the lines of `bless { 'H' => 1 }, $class`. You can then think of CuSO4 literally as Cu + S + 4*O, and use overloaded `add` and `multiply` method calls on the objects. My code does something gnarly to generate a sort of assembler for the world's slowest virtual machine: I wouldn't recommend cutting-and-pasting it! Calculating the RMM is then a simple matter of walking through the object's innards with a `while (my ($elem, $count) = each %$self )` loop and using a `%rmm` hash of `$element => $rmm` pairs. Hope this helps.	[reply] [d/l] [select]
Re^2: Regular Expressions and atomic weights by ikegami (Patriarch) on Jul 26, 2005 at 00:19 UTC
For fun, a regexp solution. It would have been much simpler if $compound didn't require an accumulator and wasn't reentrant. (Either is ok. Both makes a mess.) That's the reason behind the whole symtab business. Read more... (3 kB) What follows is a simpler solution that doesn't work. It prints "The weight of Pb(CO3)2 is 384." (instead of 327) because $rv_group gets clobbered. Read more... (3 kB)	[reply] [d/l] [select]
Re^3: Regular Expressions and atomic weights by hokie (Monk) on Jul 26, 2005 at 12:58 UTC
Thanks everyone, I've gained a lot of wisdom about this sort of subject and a solution to my current problem.	[reply]