mdunnbass has asked for the wisdom of the Perl Monks concerning the following question:

I've read perlre and perlop, and I've read the Q&A on regexes, but there's something that I think I'm missing. If I want to find
$x = 'ABCD',
but I don't care if it's interrupted by anything non-uppercase, how do I specify that?

For instance, I want this to match:
$string1 = 'ABhere is intervening textC D',
and so would this:
$string2 = 'A B C, lots of text and numbers and equals and slashes DEFG'
but this would not:
$string3 = 'ABhere IS intervening TEXTC D''

I am guessing that simply specifying =~ /[A-Z]$x/gx wouldn't do it. That would just specify that the match had to be uppercase, which is redundant, given the string definition and lack of a /i, right?

If anyone could please just point me to what I might have missed, or offer advice, I'd really appreciate it.

Thanks
Matt

Replies are listed 'Best First'.
Re: RegEx ignoring intervening characters?
by ikegami (Patriarch) on Jan 19, 2007 at 20:53 UTC
    $s =~ /A[^A-Z]*B[^A-Z]*C[^A-Z]*D/;
    or
    # Same thing, but built dynamically. my $re = join '[^A-Z]*', split //, 'ABCD'; $s =~ /$re/;

    or

    my $temp = $s; $temp =~ s/[^A-Z]//g; $temp =~ /ABCD/;
      Re: (my $x = $s) =~ s/[^A-Z]//g;

      If I understand your code right, that'd delete anything matching [^A-Z] first, and then match the pattern second, right?

      I guess I should have explained better, but I don't want to modify anything interspersed within the matching ABCD characters. In fact, I very emphatically want them to remain unmolested. So, while your approach looks like it would work, it's not quite what I was looking for.

      As for $s =~ /A[^A-Z]*B[^A-Z]*C[^A-Z]*D/;

      if the $x pattern I am looking for is from uc(chomp($x = <STDIN>)), would I just need to use split, inserting the [^A-Z]* after every character? would that work?

      Thanks
      Matt

        if the $x pattern I am looking for is from uc(chomp($x = <STDIN>)), would I just need to use split, inserting the ^A-Z* after every character? would that work?
        Yes, you would do this:
        my $x = <STDIN>; chomp $x; $x = uc($x); my $regex = join('[^A-Z]*', split //, $x);
        (my $x = $s) =~ s/[^A-Z]//g;

        That assigns the value of $s to $x, then performs the substitution on $x. $s is left unmolested.

        $ perl -e 'chomp($orig = <STDIN>); ($tmp = $orig) =~ s/[^A-Z]//g; prin +t "Copy: $tmp\nOriginal: $orig\n";' WeDfT Copy: WDT Original: WeDfT

        So it's a perfectly valid element of your solution

        Update: Hey, how about I actually show it in action?

        use strict; use warnings; chomp(my $input = <STDIN>); (my $test = $input) =~ s/[^A-Z]//g; if ($test =~ /ABCD/) { print "'$input' matches!\n"; } else { print "'$input' does not match.\n"; }
        $ perl test.pl WaFFe 'WaFFe' does not match. $ [bwisti@w3d145 tmp]$ perl test.pl AeBwaffleCfrenchtoastD 'AeBwaffleCfrenchtoastD' matches!
Re: RegEx ignoring intervening characters?
by gaal (Parson) on Jan 19, 2007 at 20:55 UTC
    The class of non uppercase English characters is [^A-Z]. So you need

    $str =~ /^A[^A-Z]*B[^A-Z]*C[^A-Z]*D$/;

    You can write this a little more clearly as:

    my $nu = qr/[^A-Z]*/; $str =~ /^ A ${nu} B ${nu} C ${nu} D $/x;

    (Use [:^upper:] or the Unicode \P{IsUpper} for non-English text.)

Re: RegEx ignoring intervening characters?
by imp (Priest) on Jan 19, 2007 at 20:59 UTC
    You could do something like this:
    use strict; use warnings; my $stuff = qr/[^A-Z]*/; my $regex = qr{ A $stuff B $stuff C $stuff D }x; while (my $line = <DATA>) { if ($line =~ $regex) { print "Matched: $line"; } } __DATA__ ABhere is intervening textC D A B C, lots of text and numbers and equals and slashes DEFG ABhere IS intervening TEXTC D
    And an alternate way of forming the pattern:
    my $stuff = qr/[^A-Z]*/; my $regex = join($stuff, split '','ABCD');
      Niggle:

      edge case? specs?

      To illustrate; note the use of three (In some ways, more precise, I think but am inviting begging for other views, please?) distinct regexen, $stuff1, $stuff2, and $stuff3 and the last line of __DATA__ and the LAST line of output

      use strict; use warnings; my $stuff1 = qr/[^A]|[^C-Z]*/; my $stuff2 = qr/[^A-B]|[^D-Z]*/; my $stuff3 = qr/[^A-C]|[^E-Z]*/; my $regex = qr{ A $stuff1 B $stuff2 C $stuff2 D }x; while (my $line = <DATA>) { chomp($line); if ($line =~ $regex) { print "Matched: \" $line \"\n"; } else { print "**Did NOT match \"$line\"\n"; } } __DATA__ ABhere is intervening textC D A B C, lots of text and numbers and equals and slashes DEFG ABhere IS intervening TEXTC D A BIG foo is B intervening text CD A Big Cat interjects itself into text before CD

      OUTPUT:

      Matched: " ABhere is intervening textC D "
      Matched: " A B C, lots of text and numbers and equals and slashes DEFG "
      **Did NOT match "ABhere IS intervening TEXTC D"
      **Did NOT match "A BIG foo is B intervening text CD"
      Matched: " A Big Cat interjects itself into text before CD "

      In the last line of __DATA__ an uppercase "C" preceeds another uppercase "C" (penultimate character), yet the regex does not object (i.e., says there's a match).

      Update: Someone upvoted this as I was updating it -- to fix mental and typographic glitches; the said updating may have removed what the ++er thought was meritorious. Sorry.
Re: RegEx ignoring intervening characters?
by diotalevi (Canon) on Jan 19, 2007 at 20:53 UTC

    You said you wanted "zero or more characters as long as they aren't uppercase" which in code is [^A-Z]* provided you think A-Z is all your uppercase characters. In POSIX that'd be [^[:upper:]]* and in Unicode it'd be \P{IsUpper}*

    ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re: RegEx ignoring intervening characters?
by Cody Pendant (Prior) on Jan 21, 2007 at 02:04 UTC
    Am I crazy or would it be sensible just to do something like
    $str =~ tr/A-Z//cd
    and then examine what's left? I don't know how to do benchmarking but rather than do a complex regex with lots of stars in it, why not turn the problem inside out?


    ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
    =~y~b-v~a-z~s; print
      It would be sensible, except for the fact that I need to keep the original text intact.