in reply to Partitioning a set of strings by regular expressions

Although these modules don't solve the problem you're asking directly, maybe they are a starting point: Regexp::Trie, Regexp::Assemble, Regex::PreSuf. Note that in your problem statement, using the a strings as an example, you're saying that you're looking for either 1, 2, 5, or 7 consecutive a's, but a+ actually matches more than that. Note how these optimizers are taking the number of a's into account.

use warnings; use strict; my @strs = qw/a b c aa bb ccc aaaaa bbb cccccc aaaaaaa bbbb ccccccc/; use Regexp::Trie; my $rt = Regexp::Trie->new; $rt->add($_) for @strs; print $rt->regexp, "\n"; # (?^:(?:a(?:a(?:aaa(?:aa)?)?)?|b(?:b(?:bb?)?)?|c(?:cc(?:cccc?)?)?)) use Regexp::Assemble; my $ra = Regexp::Assemble->new; $ra->add($_) for @strs; print $ra->re, "\n"; # (?^:(?:a(?:a(?:(?:aa)?aaa)?)?|c(?:cc(?:c?ccc)?)?|b(?:b(?:b?b)?)?)) use Regex::PreSuf; my $re = presuf(@strs); print $re, "\n"; # (?:aa(?:aaa(?:aa)?)?|bb(?:bb|b)?|ccc(?:cccc?)?|[abc])

Replies are listed 'Best First'.
Re^2: Partitioning a set of strings by regular expressions
by Locutus (Beadle) on May 11, 2020 at 12:58 UTC

    Thanks for your suggestions! Will take a closer look at each of them.

    In fact, I'd definitely not want to take the exakt number of character repetitions into account. My example simplifies the sort of strings I really have to deal with but think of S as a sample of input strings already received and of R as the "types" of input strings we can (most likely) ever expect. So it would be completely fine to categorize into a's, b's, and c's.