arunhorne has asked for the wisdom of the Perl Monks concerning the following question:

OK, heres a weird one I came across.

I have a list of chemical names; something like:

3 L-homoserine 1 L-iditol

As is the way with chemical names they have lots of synonyms and so I am replacing these with regular expressions. My problem arises because there may be an arbitrary numerator in front of the chemical name (as they came from a formula)... this needs to be preserved as the formulas will be rebuilt.

The synonym for L-iditol is L-Iditol (just a case change) so this is fine being replaced by:

s/(\d+ *)L\-iditol/\1L\-Iditol/

Simple and works... however the synonym for L-homoserine is infact 2-Amino-4-hydroxybutyric acid.

So I write the regex:

s/(\d+ *)L\-homoserine/\12\-Amino\-4\-hydroxybutyric\ acid/

When applied to L-homoserine I get a blank line then '-Amino-4-hydroxybutyric acid' (note the missing 2). Note that in the regular expression the replacement string begins \12 . The \1 bit includes the numerator (3 in the above example) but the 2 is actually part of the new string. When combined they don't work like this. I consulted by table of ascii values and find that \012 is infact the code for newline...

Has anyone had this problem before? How can I force \12 to behave as I want it rather than producing a new line?

Looking forward to resolving this strange one.

Arun

Replies are listed 'Best First'.
Re: Regular expressions Containing Octal values?
by jmcnamara (Monsignor) on May 17, 2002 at 09:44 UTC

    Use ${1} instead of \1:     s/(\d+ *)L\-homoserine/${1}2\-Amino\-4\-hydroxybutyric\ acid/;

    It is best not to use \1 on the right hand side of a s///. See also "Warning on \1 vs $1" in perlre.

    You can omit the escapes to make it a little cleaner:     s/(\d+ *)L-homoserine/${1}2-Amino-4-hydroxybutyric acid/;

    Also, the match doesn't work for cases with no leading number (you seem to suggest that this is possible). Therefore, something like this might be better:     s/(\d*\s*)L-homoserine/${1}2-Amino-4-hydroxybutyric acid/;

    --
    John.

      Thanks John. However when I include ${1} in my regular expressions I now get a lot of "Use of uninitialised value" errors. Any ideas?
        Sorry, that was my fault. I was generating text strings as regular expressions (I have a separate script that executes a file containing a set of regexes against another file) and forgot to escape the \${1}. Sorry.
      No. $1 is the value of the first subexpression from the previous successful match, which will be interpolated during the compilation of this regular expression. Definitely not what you want.

      I give some solutions in my other node in this thread.

      -- Randal L. Schwartz, Perl hacker


      update: And of course, that only makes sense if we're talking about the LHS, but it's the RHS in this example. Sorry.

        I don't get it. Isn't the previous successful match at the LHS of the s///? Isn't that interpolated into this regex? Like this:
        #!/usr/bin/perl -wl use strict; $_ = "3 L-homoserine"; /(\w-\w)/; print $1; s/(\d*\s*)L-homoserine/${1}2-Amino-4-hydroxybutyric acid/; print; $_ = "L-homoserine"; s/(\d*\s*)L-homoserine/${1}2-Amino-4-hydroxybutyric acid/; print; __END__ Prints: L-h 3 2-Amino-4-hydroxybutyric acid 2-Amino-4-hydroxybutyric acid

        --
        John.

Re: Regular expressions Containing Octal values?
by Zaxo (Archbishop) on May 17, 2002 at 11:23 UTC

    How about another approach altogether? Try a hash with all the different names as keys and a reference to some canonical name or data table row as value.

    my $foostuff = '2-Amino-4-hydroxybutyric acid'; my $barstuff = 'ethanol'; my %substances = ( 'L-homoserine' => \$foostuff, 'L-Homoserine' => \$foostuff, 'beer' => \$barstuff, );

    I think that will be much simpler to deal with than a similar number of regexen.

    By the way, you don't need to deal with numbers as ascii values, numeral strings are fine.

    After Compline,
    Zaxo

      One good thing about this hash approach is that, when you want to "rebuild" with the original names, you can just reverse the hash to build the "reverse lookup table."
      %reverse_hash = reverse %hash;
      This won't work if the values in the original hash aren't unique, of course. (I don't know chemistry, so I don't know if that's the case). Plus, you'd probably need to use actual strings as values for the original hash, as well.

      --

      Mephit (See my home node for my rant about Opera and PerlMonks, and my earliest nodes.

•Re: Regular expressions Containing Octal values?
by merlyn (Sage) on May 17, 2002 at 15:27 UTC
    You'll need to distinguish the \1 from the following 2. A few ideas come to mind:
    • Use [2] instead of 2 (a simple character class)
    • Use (?:2) instead of 2 (a non-capturing paren)
    • Use (?:\1) instead of \1
    • Use /x mode, and insert a space
    • Insert an empty non-capturing paren between the two: (?:) {grin}

    -- Randal L. Schwartz, Perl hacker


    update: And these are all great solutions if we're talking about the regex part of the substitution, but the original questioner was asking about the replacement part. Gah.

    In that case, just use ${1} instead. Shouldn't have been using \1 in the first place. {grin}