Cefu has asked for the wisdom of the Perl Monks concerning the following question:
Monks,
I appologize in advance for this being a RegEx rather than Perl specific question.
I'm trying to clean up a list of data entered in a free-text field (who needs validation anyway). For the most part the data consists of one or more numbers (which I want to keep) and sometimes color "words" which might appear before, after or between the number. I want to discard some specific colors but not other colors or other text. I'm trying to craft a regex to match the following and remove it:
So my current regex substitution looks something like:
s/\s*\(?\s*re?d?\s*\)?\s*//gi
This seemed to be working flawlessly until my spot checks revealed the following humorous example:
12345 Gray 6789 Red => 12345 Gay 6789
To avoid workplace embarrassment I thought it best to make sure that the bit I was removing occurred either just before or just after a number rather than in the middle of other text. So my thought is to modify the regex to somthing like
s/(\d?)\s*\(?\s*re?d?\s*\)?\s*(\d?)//gi
The problem is that I can't leave both digits optional (as shown) or I'm still in the same boat. I also can't make either one mandatory or I'm dictating a before-number-only or after-number-only match. What I really want is one or the other (or both) but not neither.
As you might guess from the parens around the digits, I also considered checking what matched in the second part and substituting back in the original if I didn't see a digit. However, I ran in to various problems ($1 being undefined, the A?B:C syntax not working inside the regex, etc.)
So, is there some nice way to do this in a single regex? Can I somehow ask that the regex match one or more of two disjointed parts?
Thanks,
Cefu
Update: Found a solution that almost matches my requirements and is actually better for what I needed. I'd downvote myself for not thinking this through first if I could. :)
s/(^|\d)\s*\(?\s*re?d?\s*\)?\s*($|\d)/$1$2/gi
While messing around with getting a conditional to work on the right side of the substitution I noticed that it happily substituted no characters when $1 or $2 are undefined. It does the same, without complaint, when they match the anchors ^ and $. So, rather than shoot for some weird hybrid of optional and mandatory I decided to make them both mandatory but with palatable alternatives (the anchors).
Where this differes from my requirements is that it will not match a Red-like word if there is a digit on one side and more text (rather than the beginning or end of the string) on the other. So, for example, with my new regex:
(Red) 123 Reddish-Orange 456 Orange 789 => 123 Reddish-Orange 456 Orange 789
Whereas if someone had come up with a way to get the behavior I asked for above it would have done this:
(Red) 123 Reddish-Orange 456 Orange 789 => 123 dish-Orange 456 Orange 789
The results of my number-or-edge-of-string before and after suit me better.
Requirements translate into code or code translates into requirements..... I never can seem to remember how that's supposed to go. :)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: RegEx to match at least one non-adjacent term
by ikegami (Patriarch) on Dec 07, 2007 at 16:12 UTC | |
by Cefu (Beadle) on Dec 07, 2007 at 17:00 UTC | |
by ikegami (Patriarch) on Dec 07, 2007 at 17:09 UTC | |
|
Re: RegEx to match at least one non-adjacent term
by tuxz0r (Pilgrim) on Dec 07, 2007 at 16:08 UTC | |
by ikegami (Patriarch) on Dec 07, 2007 at 16:26 UTC | |
|
Re: RegEx to match at least one non-adjacent term
by toolic (Bishop) on Dec 07, 2007 at 16:34 UTC | |
by Cefu (Beadle) on Dec 07, 2007 at 17:12 UTC | |
|
Re: RegEx to match at least one non-adjacent term
by CountZero (Bishop) on Dec 07, 2007 at 21:38 UTC | |
|
Re: RegEx to match at least one non-adjacent term
by Not_a_Number (Prior) on Dec 07, 2007 at 19:43 UTC |