Regular Expression to find Word Prefixes

arunhorne has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regular Expression to find Word Prefixes by Corion (Patriarch) on May 19, 2002 at 15:54 UTC
Spontaneously, I came up with a cheating idea : Instead of capturing the number, you could try to throw away the name of the chemical : `#!/usr/bin/perl -w use strict; while (<DATA>) { chomp; if (/^(.?)\s[A-Za-z]+$/) { print "Found $1 in $_\n"; } else { warn "Don't know how to handle $_\n"; }; };` [download] This regular expression is of course very crude, as it will accept anything before the first whitespace as the number and discard everything after that. If your data dosen't allow this "easy-way-out-solution, you could try to gather all of your different regular expressions into one big regular expression by putting them into non-capturing parentheses and concatenating them with alternation : `if (m!^((?:\d+)\|(?:\d+/\d+)\|(\N\+1\))\|\d[MN])\s!) { print "Found $1 in $_"; };` [download] If the above big RE works for you, you can start optimizing it and collecting same parts together, for example the first part `\d+` is just a sub-part of `\d+/\d+` and the two could be folded into one using the optional operator `?` like : `m!\d+(/\d+)?! # Here, we use capturing parentheses for legibility m!\d+(?:/\d+)! # Noncapturing parentheses for programmer-ease` [download] If you want more information, the perlre manpage has more information, but if you think you really want to use regular expressions as a frequent tool, there is no way around Jeffrey Friedels Mastering Regular Expressions (but it's rumored that there will be a new version in the near future...) `perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web` [download]	[reply] [d/l] [select]
Re: Re: Regular Expression to find Word Prefixes by Dog and Pony (Priest) on May 19, 2002 at 17:26 UTC
if you think you really want to use regular expressions as a frequent tool, there is no way around Jeffrey Friedels Mastering Regular Expressions Spending some time in japhy's book on the subject might also prove useful in the meantime. :) You have moved into a dark place. It is pitch black. You are likely to be eaten by a grue.	[reply]
Re: Regular Expression to find Word Prefixes by Zaxo (Archbishop) on May 19, 2002 at 16:15 UTC
Does `(split " ")[0];` work for you? After Compline, Zaxo	[reply] [d/l]
Re: Regular Expression to find Word Prefixes by enoch (Chaplain) on May 19, 2002 at 17:35 UTC
Well, if the prefixes never contain spaces, Zaxo's fix works perfectly. Though, if you cannot make that assumption, but you can make the assumption that the actual chemical names never have a space in them, you can use japhy's sexeger techiniques. That is, reverse the string and apply the regex there. `#!/usr/bin/perl -w use strict; my @data = ('12 chem1', '1/2 chem2', '(N+1) chem3', '2M chem4', 'N chem5'); my $chem; foreach (@data) { $chem = reverse $_; $chem =~ s/^\S+//; $chem = reverse $chem; print $chem . "\n"; }` [download] Jeremy	[reply] [d/l]
Re: Re: Regular Expression to find Word Prefixes by Juerd (Abbot) on May 19, 2002 at 19:01 UTC
$chem = reverse $_; $chem =~ s/^\S+//; $chem = reverse $chem; print $chem . "\n"; How about `($chem) = $chem =~ /(.*)\s/` [download] - Yes, I reinvent wheels. - Spam: Visit eurotraQ.	[reply] [d/l]
Re: Re: Regular Expression to find Word Prefixes by arunhorne (Pilgrim) on May 19, 2002 at 21:39 UTC
I really like the that reverse idea, and thanks for pointing out japhy's book thats really handy too., Arun	[reply]
Re: Regular Expression to find Word Prefixes by hotshot (Prior) on May 19, 2002 at 15:59 UTC
If your text after the prefix is always chemX (where X is a digit) or a lowered case word, or a space is separating the words, you can do: `/(\S+)\s+(\S+)/` [download] Thanks. Hotshot	[reply] [d/l]
The Final Solution I Adopted by arunhorne (Pilgrim) on May 19, 2002 at 21:43 UTC
I wrote a subroutine that manifests some of the ideas I was given. Thanks everyone who contributed. Here's the code: `# # in : a name # out : the prefix (or "" if none) in array pos 0 and the # name (always present) in array pos 1. # # Understands the following forms: # # 2 name # 1/2 name or 1\2 name # (N+1) name # 2M name or 2N name # M name or N name # sub get_name_parts($) { ($_) = @_; if (m!^(\d+\|\d+[\\/]\d+\|$N\+1$\|\d*[MN])\s(.+)!i) { return ($1, $2); } else { return ("", $_); }; }` [download] Thanks again and hope this helps people in the future. Arun	[reply] [d/l]
Re: Regular Expression to find Word Prefixes by vladdrak (Monk) on May 19, 2002 at 23:15 UTC
Well, if you can assume that the data will always follow the pattern you presented, then you can split on a space: `my @words=( "12 chem1", "1/2 chem2", "(N+1) chem3", "2M chem4", "N chem5" ); # given an array of words, returns an # array of corresponding prefixes sub get_prefixes { map { (split)[0] } @_; } print join(",",get_prefixes(@words));` [download] -Vlad	[reply] [d/l]
Re: Regular Expression to find Word Prefixes by xgunnerx (Initiate) on May 20, 2002 at 13:23 UTC
Just use split: my $var = '(N+1) chem3'; # Or whatever my ($var1, $var2) = split (/\s/,$var); print "$var1\n"; print "$var2\n";	[reply]
Re: Re: Regular Expression to find Word Prefixes by arunhorne (Pilgrim) on May 20, 2002 at 14:37 UTC
xgunnerx ... this isn't an option because of the following real life situation: Two atomic entries found in my data set: `N'-phosphoguanidinoethyl methyl phosphate 2N N'-phosphoguanidinoethyl methyl phosphate` [download] For the first entry your script will return the following when it should return an empty string as there is no prefix: `N'-phosphoguanidinoethyl methyl` [download] For the second entry it will return the following when it should return "2N" as it is the prefix (note that the entire name minus prefix is not returned by your code as the 'methyl phosphate' biut is missing: `2N N'-phosphoguanidinoethyl` [download] I admit that given the original data set that I provided your solution would work, but it does not allow for arbitrary names with spaces. Arun	[reply] [d/l] [select]