jrblas has asked for the wisdom of the Perl Monks concerning the following question:

The regular expression...

 /ABCD{3}.E/

...matches: ABCDDDJE ABCDDDLE ABCDDD6E ABCDDD?E

How should I write it to match all strings with 2 D letters in the 4th, 5th and 6th position, irrespectively of where they are within this segment

That is, to match: ABC6DDJE ABCD6DJE ABCDD6JE ABCDD6LE ... and so on

Better than brute force approaches, I would appreciate a generic way of writing this regular expression, since I want to apply it to cases with multiple repetitions and with variable length. For instance, regexp like this:

 /D.{3}[A-Z]{3,15}[0-9]{6,8}[^D]{2}/

Thanks a lot in advance, JR

Replies are listed 'Best First'.
Re: regexp with mismatches
by JavaFan (Canon) on Mar 25, 2012 at 16:33 UTC
    For the first question, use:
    /ABC(?:DD.|D.D|.DD).E/; # Add /s if newline is acceptable near the +Ds

    As for the generic case, you don't. Regexes aren't suitable for that. For particular cases, you may be able to get away with listing all possibilities, or by using (?{...}), but in a general case, you'd be using the wrong tool.

Re: regexp with mismatches
by graff (Chancellor) on Mar 25, 2012 at 23:52 UTC
    I'm not sure where you're trying to go with your regex for "multiple repetitions with variable length", (that is, it's not clear what sorts of string sets you want to match), but one alternative for the initial case is a two-step test, where the second step uses tr///:
    if ( /ABC(...).E/ and 2 <= ( $1 =~ tr/D// )) { # get here when the captured region contains 2 or 3 D's }
    I think this sort of approach would scale reasonably well for more complicated cases: just include more captures in the initial regex match for the regions that need to pass a second condition, and add more tests using tr/// on each capture -- it might look like this (if this is the direction you're heading towards):
    if ( /D (.{3}) [A-Z]{3,15} (\d{6,8}) [^D]{2}/x and 2 <= ( $1 =~ tr/F//) # first capture contains at least two F +'s and 3 <= ( $2 =~ tr/1//) # second capture contains at least three + 1's ) { # get here when all conditions are met }
    (Note that tr/// does not affect the contents of capture variables $1, $2, etc.)

    i think this also helps keep the logic more coherent and maintainable as code. Loading too many diverse conditions into a single, exhaustive regex can get cumbersome.

Re: regexp with mismatches
by Anonymous Monk on Mar 25, 2012 at 16:31 UTC
    123(?:.DD|D.D|DD.)