Multi_captures needed

rsFalse has asked for the wisdom of the Perl Monks concerning the following question:

Hello.

Yesterday I was solving problem in a contest trying to use regexes. The statement of the problem is here: http://codeforces.com/contest/633/problem/C. There are some performance limits: time and memory.

How would you solve this problem?
Shortly saying, the main problem is to use some words from given dictionary (each word zero or more times) to cover a given long string without overlays. For example, we have dictionary - qw { cb ae cba ed } and string 'cbaed'. We can cover it with two words: 'cba' and 'ed'.
I used embedded code in regex - /(?{ ... })/ and array to save submatches. And seems that if 'multi_captures' would be implemented in regex then solution was easier. Mine solution looks so:

chomp( (undef, $_, undef, @words) = <> );

%restore_case = map { lc, $_ } @words;

$list = join '|', map { scalar reverse lc } @words;

m/^ (?: ($list | $ | (?{ pop @submatches }) (?!) ) 
(?{ push @submatches, $^N }) )+ $/x;

pop @submatches;

print join ' ', map { $restore_case{ scalar reverse } } @submatches
[download]

Or you can see it with some tests here on a platform - http://codeforces.com/contest/633/submission/16382698
Explanation:

m/^ 
    (?: 
        ( $list  # dictionary joined with '|'
            | $  # if end of line then success, so don't pop array
            | (?{ pop @submatches }) (?!)  # pop array and suddenly fa
+il
        ) 
        (?{ push @submatches, $^N })  # after every matched word, push
+ it to array
        # $^N - last captured group
    )+ 
$/x;
[download]

I wanted to discuss: are there other solutions using regexes, and is it possible to increase performance?

I've read nice node - Advanced techniques with regex quantifiers. And tried to use '$^R' instead of array '@submatches' and it memory-limited on the contest platform - http://codeforces.com/contest/633/submission/16383676 . For me it seems that using push/pop is better than using arrays with $^R.
Later I took an example from topic about Advanced techniques:

":aa2bb4cc6dd8" =~ / (:)
                     (?{ [] })   # initialize $^R
                     (?:
                         (\w\w) (\d)
                         
                         # add captures to $^R:
                         (?{ [@{$^R}, [$2, $3]] })
                     )*
                   /x;
[download]

And rewrote it in push/pop array manner:

my @R;
":aa2bb4cc6dd8" =~ / (:)
                     (?:
                         (\w\w) (\d)
                         
                         (?{ push @R, [$2, $3] })
                     )*
                   /x;
[download]

And it seems to work much more faster on bigger inputs. I was measuring on ideone.com (semicolon removed):
This code worked 1.3 s (string x 1e3) (http://ideone.com/zMOyLS):

use Data::Dumper;
sub dd { print Dumper(shift) };

("aa2bb4cc6dd8" x 1e3) =~ /
                     (?{ [] })   # initialize $^R
                     (?:
                         (\w\w) (\d)
                         
                         # add captures to $^R:
                         (?{ [@{$^R}, [$1, $2]] })
                     )*
                   /x;
                   
dd @{ $^R }[ -1 ]
[download]

And this code worked 0.06 s (string x 5e3) (http://ideone.com/n3cKVr):

use Data::Dumper;
sub dd { print Dumper(shift) };

my @R = ();

("aa2bb4cc6dd8" x 5e3) =~ /
                     (?:
                         (\w\w) (\d)
                         
                         # push captures to @R:
                         (?{ push @R, [$1, $2] })
                     )*
                   /x;
                   
dd @R[ -1 ]
[download]

both output:

$VAR1 = [
          'dd',
          '8'
        ];
[download]

Comment on Multi_captures needed Select or Download Code

Replies are listed 'Best First'.
Re: Multi_captures needed by Eily (Monsignor) on Feb 28, 2016 at 12:38 UTC
You can have something that kind of behaves like "multiple captures" by matching in list context and using look ahead assertions to make sure the current match still allows for the rest of the string to match. Not sure about performance, but it's a solution that doesn't require to embed code in the regex. `use Data::Dump qw/pp/; chomp( (undef, $_, undef, @words) = <> ); # If multiple words are identical as lowercase, only the last one will + be kept %restore_case = map { lc reverse, $_ } @words; $list = join '\|', keys %restore_case; # Match any word in the list, only if it can be followed by words in t +he list until the end of the string my @result = map { $restore_case{$_} } /($list)(?=(?:$list)*$)/g; pp @result;` [download] You know that there is a match or the end of the string after each match, so after the first match you know the regex will always succeed at the current position. Adding \G can make that more explicit, and force the overall match to start at position 0.	[reply] [d/l]
Re: Multi_captures needed by AnomalousMonk (Archbishop) on Feb 28, 2016 at 19:05 UTC
I really like Eily's ++approach: it's short and sweet and probably sufficiently fast for short strings and word lists. It has the acknowledged problem of doing a lot of redundant look-aheading and so may be too slow for long strings (for some definition of "long"), but I'd be willing to cross that bridge when I come to it. One issue I don't see addressed stems from the fundamental nature of Perl 5 ordered alternations: the first match in the list of alternations is the final match. There is no consideration of "longest match", etc.; it's first come, first served. Consider `c:\@Work\Perl>perl -wMstrict -le "use Data::Dump qw(pp); ;; sub multi_cap_Eily { ;; my ($word, @words) = @_; ;; my ($dict) = map qr{ $_ }xms, join '\|', @words; ;; return $word =~ m{ \G ($dict) (?= $dict* \z) }xmsg; } ;; pp multi_cap_Eily('abcd', qw(ab cd a b c d abcd)); " ("ab", "cd")` [download] This uses what I would call a "naive" alternation producing the sub-strings `'ab' 'cd'` that are simply the first to match of the list of "words" that happened to be supplied. Is this result correct? Because the word patterns being supplied are simple character groups, it's easy to introduce a notion of shortest or longest matching. The statement `my ($dict) = map qr{ $_ }xms, join '\|', sort @words;` with a default ascending lexical sort produces the sub-strings `("a" .. "d")` and the descending `reverse sort` `my ($dict) = map qr{ $_ }xms, join '\|', reverse sort @words;` produces `"abcd"` I would consider either of those results to be more correct than the naive result if only because they are both independent of the adventitious ordering of the list of input sub-words. (I'm avoiding the whole business of converting back and forth between mixed- and single-case because that just seems tangential to the issue I'm trying to address.) This aspect of ordered alternation matching is often overlooked. Sometimes it can be easily addressed. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]