arunhorne has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys,

I have this set of chemical names something like this:

12 chem1 1/2 chem2 (N+1) chem3 2M chem4 N chem5

What I want to do is write a function to process each word in turn and extract just its prefix... i.e. 12 or 1/2 or (N+1) etc. The prefixes are of course described by regular expressions (as it may be any number infront of a name, not just 12!), i.e...

\d+ \d+\/\d+ \(N\+1\) \d*[MN]

I just can't seem to write such a function, can anyone help?

Thanks, Arun

Replies are listed 'Best First'.
Re: Regular Expression to find Word Prefixes
by Corion (Patriarch) on May 19, 2002 at 15:54 UTC

    Spontaneously, I came up with a cheating idea :
    Instead of capturing the number, you could try to throw away the name of the chemical :

    #!/usr/bin/perl -w use strict; while (<DATA>) { chomp; if (/^(.*?)\s[A-Za-z]+$/) { print "Found $1 in $_\n"; } else { warn "Don't know how to handle $_\n"; }; };

    This regular expression is of course very crude, as it will accept anything before the first whitespace as the number and discard everything after that. If your data dosen't allow this "easy-way-out-solution, you could try to gather all of your different regular expressions into one big regular expression by putting them into non-capturing parentheses and concatenating them with alternation :

    if (m!^((?:\d+)|(?:\d+/\d+)|(\N\+1\))|\d*[MN])\s!) { print "Found $1 in $_"; };

    If the above big RE works for you, you can start optimizing it and collecting same parts together, for example the first part \d+ is just a sub-part of \d+/\d+ and the two could be folded into one using the optional operator ? like :

    m!\d+(/\d+)?! # Here, we use capturing parentheses for legibility m!\d+(?:/\d+)! # Noncapturing parentheses for programmer-ease

    If you want more information, the perlre manpage has more information, but if you think you really want to use regular expressions as a frequent tool, there is no way around Jeffrey Friedels Mastering Regular Expressions (but it's rumored that there will be a new version in the near future...)

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
      if you think you really want to use regular expressions as a frequent tool, there is no way around Jeffrey Friedels Mastering Regular Expressions
      Spending some time in japhy's book on the subject might also prove useful in the meantime. :)
      You have moved into a dark place.
      It is pitch black. You are likely to be eaten by a grue.
Re: Regular Expression to find Word Prefixes
by Zaxo (Archbishop) on May 19, 2002 at 16:15 UTC

    Does (split " ")[0]; work for you?

    After Compline,
    Zaxo

Re: Regular Expression to find Word Prefixes
by enoch (Chaplain) on May 19, 2002 at 17:35 UTC
    Well, if the prefixes never contain spaces, Zaxo's fix works perfectly.

    Though, if you cannot make that assumption, but you can make the assumption that the actual chemical names never have a space in them, you can use japhy's sexeger techiniques. That is, reverse the string and apply the regex there.

    #!/usr/bin/perl -w use strict; my @data = ('12 chem1', '1/2 chem2', '(N+1) chem3', '2M chem4', 'N chem5'); my $chem; foreach (@data) { $chem = reverse $_; $chem =~ s/^\S+//; $chem = reverse $chem; print $chem . "\n"; }
    Jeremy

      $chem = reverse $_; $chem =~ s/^\S+//; $chem = reverse $chem; print $chem . "\n";

      How about

      ($chem) = $chem =~ /(.*)\s/

      - Yes, I reinvent wheels.
      - Spam: Visit eurotraQ.
      

      I really like the that reverse idea, and thanks for pointing out japhy's book thats really handy too., Arun
Re: Regular Expression to find Word Prefixes
by hotshot (Prior) on May 19, 2002 at 15:59 UTC
    If your text after the prefix is always chemX (where X is a digit) or a lowered case word, or a space is separating the words, you can do:
    /(\S+)\s+(\S+)/


    Thanks.

    Hotshot
The Final Solution I Adopted
by arunhorne (Pilgrim) on May 19, 2002 at 21:43 UTC

    I wrote a subroutine that manifests some of the ideas I was given. Thanks everyone who contributed.

    Here's the code:

    # # in : a name # out : the prefix (or "" if none) in array pos 0 and the # name (always present) in array pos 1. # # Understands the following forms: # # 2 name # 1/2 name or 1\2 name # (N+1) name # 2M name or 2N name # M name or N name # sub get_name_parts($) { ($_) = @_; if (m!^(\d+|\d+[\\/]\d+|\(N\+1\)|\d*[MN])\s(.+)!i) { return ($1, $2); } else { return ("", $_); }; }

    Thanks again and hope this helps people in the future.

    Arun

Re: Regular Expression to find Word Prefixes
by vladdrak (Monk) on May 19, 2002 at 23:15 UTC
    Well, if you can assume that the data will always follow the pattern you presented, then you can split on a space:
    my @words=( "12 chem1", "1/2 chem2", "(N+1) chem3", "2M chem4", "N chem5" ); # given an array of words, returns an # array of corresponding prefixes sub get_prefixes { map { (split)[0] } @_; } print join(",",get_prefixes(@words));
    -Vlad
Re: Regular Expression to find Word Prefixes
by xgunnerx (Initiate) on May 20, 2002 at 13:23 UTC

    Just use split:

    my $var = '(N+1) chem3'; # Or whatever

    my ($var1, $var2) = split (/\s/,$var);

    print "$var1\n";
    print "$var2\n";

      xgunnerx ... this isn't an option because of the following real life situation:

      Two atomic entries found in my data set:

      N'-phosphoguanidinoethyl methyl phosphate 2N N'-phosphoguanidinoethyl methyl phosphate

      For the first entry your script will return the following when it should return an empty string as there is no prefix:

      N'-phosphoguanidinoethyl methyl

      For the second entry it will return the following when it should return "2N" as it is the prefix (note that the entire name minus prefix is not returned by your code as the 'methyl phosphate' biut is missing:

      2N N'-phosphoguanidinoethyl

      I admit that given the original data set that I provided your solution would work, but it does not allow for arbitrary names with spaces.

      Arun