adamcrussell has asked for the wisdom of the Perl Monks concerning the following question:

Suppose I want to match the occurrence of 4 characters in a string with the condition that all 4 must occur at least once and in any order.
Well, with a regex engine that implements greedy alternation I could have one single regex such as this egrep example:
$ echo 1b22a3d3c | egrep -o 'a|b|c|d' b a d c
But with perl(no greedy alternation) I would need to construct a large-ish if statement
echo 1b22a3d3c | perl -ne'if(/a/&&/b/&&/c/&&/d/){print"Match!\n";}'
If I am curious about in which order the 4 chars actually occurred in I can test each permutation using List::Permutor
use strict; use warnings; use List::Permutor; my $test_string="1b22a3d3c"; my @regex_parts=("a","b","c","d"); my $permutated_regex_parts=new List::Permutor @regex_parts; while (my @set = $permutated_regex_parts->next){ my $pattern=construct_regex(@set); if($test_string =~ /$pattern/){ print("a match! pattern is: $pattern\n"); } } sub construct_regex{ my $regex=".*"; foreach my $s (@_){ $regex.="$s.*"; } return $regex; }
I have toyed with this problem a bit...I am sure I missed several possible insights into better/shorter ways to accomplish this. Any advice?

Replies are listed 'Best First'.
Re: How to replace greedy alternation?
by ikegami (Patriarch) on Jan 28, 2009 at 04:32 UTC
    It can be done with one regexp.
    /^(?=.*a)(?=.*b)(?=.*c).*d/ or die;

    If you want to know the order in which they came,

    my %order; @order{qw( a b c d )} = map length, /^(?=(.*?)a)(?=(.*?)b)(?=(.*?)c)(.*?)d/ or die; my @ordered = sort { $order{$a} <=> $order{$b} } keys %order;

    This is probably slow for long strings.

    Update: Actually, no need for a hash.

    my @order = map length, /^(?=(.*?)a)(?=(.*?)b)(?=(.*?)c)(.*?)d/ or die; my @ordered = ( qw( a b c d ) )[ sort { $order[$a] <=> $order[$b] } 0..$#order ];

    Update: Back to the practical, using multiple regexp and saving @- should be much faster. Especially if scanning for constants.

    my @unordered = qw( a b c d ); my @order; for my $s (@unordered) { push @order, /\Q$s/ ? $-[0] : die; } my @ordered = @unordered[ sort { $order[$a] <=> $order[$b] } 0..$#order ];

      Another way to do it:

      my %order; /(?=.*(a))(?=.*(b))(?=.*(c))/ and @order{ qw/a b c/ } = @-[ 1 .. $#- ] +;
      This code does not work:
      /^(?=.*a)(?=.*b)(?=.*c).*d/
      For example, when used in the code below nothing s returned.
      echo bacd| perl -ne '/^(?=.*a)(?=.*b)(?=.*c).*d/; print "$1 $2 $3\n";'

        That is because there are no capturing parentheses in your regular expression and so $1, $2 and $3 are not affected and thus contain nothing.

Re: How to replace greedy alternation?
by BrowserUk (Patriarch) on Jan 28, 2009 at 04:25 UTC
Re: How to replace greedy alternation?
by kyle (Abbot) on Jan 28, 2009 at 04:27 UTC

    This is what I came up with.

    my $rx = qr/ ([abcd]) # a, b, c, or d -- stored in \1 .* # well, whatever (?!\1) # the next character can't be \1 ([abcd]) # a, b, c, or d -- stored in \2 .* # y'know, stuff (?!\1|\2) # next thing not \1 or \2 ([abcd]) # a, b, c, or d -- stored in \3 .* (?!\1|\2|\3) # next thing not \1, \2, or \3 [abcd] # you get the idea /x;

    I have a feeling there's still an easier way, but that's not too bad. Here is some testing.

    Update: See below, ikegami has the better way.

Re: How to replace greedy alternation?
by repellent (Priest) on Jan 28, 2009 at 05:29 UTC
    Another way to find out the order:
    my $test_string="1b22a3d3c"; my @ordered; push(@ordered, $1) while $test_string =~ /([abcd])/cg; pos($test_string) = 0; # reset m//cg offset

    Update: Used capture as per ikegami's reply.

    Update #2: This may be better:
    my %seen; my @order = grep { !$seen{$_}++ } $test_string =~ /([abcd])/g;

      $& imposes a needless slowdown on all regexp in your process that aren't already slowed down by captures. Use a capture instead.

      Your solution assumes none of the four patterns will match twice.

      TMTOW to find the order:
      my $test = "1b22a3d3cccccccba866.afgg"; (my $ordered = $test) =~ tr/abcd//cd; print $ordered, "\n"; # badcccccccbaa

        Indeed, getting rid of the clutter first, and then remove all but the first instance of each letter:

        sub reduc { my ($s) = @_ ; $s =~ tr/abcd//cd ; 1 while $s =~ s/(.).*?\K\1+//g ; return $s ; } ; print reduc('1b22a7b3d3ccaacccbcccacccddcceqcc11oabepd'), "\n" ; # badc print reduc('1b22w7b3d3ccsqcccbccc2cccddcceqcc11o9bepd'), "\n" ; # bdc
        I imagine a more cunning monk can improve the reduction step -- though the amount of work it does obviously depends on how many repeated letters need to be removed.

Re: How to replace greedy alternation?
by Marshall (Canon) on Jan 28, 2009 at 08:11 UTC
    I'll offer yet another solution....
    There is a requirement that each of the letters much appear at least once. And I interpret "order" to mean the order that they appeared in the string, not in frequency or order of the list that we are looking for...maybe that is wrong.

    The match global will just go through the input string once. The while() loop iterates over a list that the match global expression produced. I'm not sure how "smart" regex is with /o option (compile regex once). Having a join may be slower than I think, but my testing so far with Perl 5.10 says that regex is way smarter (and faster) than it used to be! You can do the join other ways or declare other variables. One main point about regex is that picking from a set [xyz] is a lot faster than /x/|/y/|/z|.

    Map and sort are pretty "heavy weight" critters. A hash table especially with a small number of keys is very efficient. A Perl hash gets created by default with 8 hash "buckets". Don't be afraid to use a hash. This code is verbose, but is straightforward and will run quickly.

    #!/usr/bin/perl -w use strict; my $str = "1b22a3d3cab"; my @letters = qw (a b 2 c d); my @order_found =(); my %seen =(); #UPDATE: don't need the $found variable, oops! #while (my $found = $str =~ m/([join("",@letters)])/go) #should have been: while ($str =~ m/([join("",@letters)])/go) { push (@order_found, $1) unless $seen{$1}; $seen{$1}++; } if (keys %seen != @letters) { print "not all letters found\n"; } else { print @order_found, "\n"; # prints b2adc }
Re: How to replace greedy alternation?
by jink (Scribe) on Jan 28, 2009 at 12:36 UTC
    It's been ages since I last posted here, but here goes.

    Your first statement is false. This egrep doesn't make sure that all characters occur, so it makes sense that it seems to be harder in perl. Egrep tries to match either of the characters you specified, and spits out only the parts that match, because of the -o:
    echo 1b22a3d3 | egrep -o 'a|b|c|d' b a d

    All Camels are equal, but some Camels are more equal than others.
      egrep finds all the cases because it has greedy alternation(the subject of this Seekers...). -o simply has egrep print out the matches it does find. A complete non-perl solution could be something like
      echo 1b22cc333a | egrep -o 'a|b|c' | sort -u | wc -l
      This will simply return 3. If I want 3 chars in any order and this command returns 3 than I had a successful match of all 3 in any order!