Monks,

I appologize in advance for this being a RegEx rather than Perl specific question.

I'm trying to clean up a list of data entered in a free-text field (who needs validation anyway). For the most part the data consists of one or more numbers (which I want to keep) and sometimes color "words" which might appear before, after or between the number. I want to discard some specific colors but not other colors or other text. I'm trying to craft a regex to match the following and remove it:

  • The word Red and variations thereof (R, r, rd etc.); multiple occurrences; any case
  • Optionally surrounded by various spaces and parenthesis; not important if they are matched
  • So my current regex substitution looks something like:

       s/\s*\(?\s*re?d?\s*\)?\s*//gi

    This seemed to be working flawlessly until my spot checks revealed the following humorous example:

       12345 Gray 6789 Red => 12345 Gay 6789

    To avoid workplace embarrassment I thought it best to make sure that the bit I was removing occurred either just before or just after a number rather than in the middle of other text. So my thought is to modify the regex to somthing like

       s/(\d?)\s*\(?\s*re?d?\s*\)?\s*(\d?)//gi

    The problem is that I can't leave both digits optional (as shown) or I'm still in the same boat. I also can't make either one mandatory or I'm dictating a before-number-only or after-number-only match. What I really want is one or the other (or both) but not neither.

    As you might guess from the parens around the digits, I also considered checking what matched in the second part and substituting back in the original if I didn't see a digit. However, I ran in to various problems ($1 being undefined, the A?B:C syntax not working inside the regex, etc.)

    So, is there some nice way to do this in a single regex? Can I somehow ask that the regex match one or more of two disjointed parts?

    Thanks,
    Cefu

    Update: Found a solution that almost matches my requirements and is actually better for what I needed. I'd downvote myself for not thinking this through first if I could. :)

       s/(^|\d)\s*\(?\s*re?d?\s*\)?\s*($|\d)/$1$2/gi

    While messing around with getting a conditional to work on the right side of the substitution I noticed that it happily substituted no characters when $1 or $2 are undefined. It does the same, without complaint, when they match the anchors ^ and $. So, rather than shoot for some weird hybrid of optional and mandatory I decided to make them both mandatory but with palatable alternatives (the anchors).

    Where this differes from my requirements is that it will not match a Red-like word if there is a digit on one side and more text (rather than the beginning or end of the string) on the other. So, for example, with my new regex:

       (Red) 123 Reddish-Orange 456 Orange 789 => 123 Reddish-Orange 456 Orange 789

    Whereas if someone had come up with a way to get the behavior I asked for above it would have done this:

       (Red) 123 Reddish-Orange 456 Orange 789 => 123 dish-Orange 456 Orange 789

    The results of my number-or-edge-of-string before and after suit me better.

    Requirements translate into code or code translates into requirements..... I never can seem to remember how that's supposed to go. :)


    In reply to RegEx to match at least one non-adjacent term by Cefu

    Title:
    Use:  <p> text here (a paragraph) </p>
    and:  <code> code here </code>
    to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.