Re: Partitioning a set of strings by regular expressions

Although these modules don't solve the problem you're asking directly, maybe they are a starting point: Regexp::Trie, Regexp::Assemble, Regex::PreSuf. Note that in your problem statement, using the a strings as an example, you're saying that you're looking for either 1, 2, 5, or 7 consecutive a's, but a+ actually matches more than that. Note how these optimizers are taking the number of a's into account.

use warnings;
use strict;

my @strs = qw/a b c aa bb ccc aaaaa bbb cccccc aaaaaaa bbbb ccccccc/;

use Regexp::Trie;
my $rt = Regexp::Trie->new;
$rt->add($_) for @strs;
print $rt->regexp, "\n";
# (?^:(?:a(?:a(?:aaa(?:aa)?)?)?|b(?:b(?:bb?)?)?|c(?:cc(?:cccc?)?)?))

use Regexp::Assemble;
my $ra = Regexp::Assemble->new;
$ra->add($_) for @strs;
print $ra->re, "\n";
# (?^:(?:a(?:a(?:(?:aa)?aaa)?)?|c(?:cc(?:c?ccc)?)?|b(?:b(?:b?b)?)?))

use Regex::PreSuf;
my $re = presuf(@strs);
print $re, "\n";
# (?:aa(?:aaa(?:aa)?)?|bb(?:bb|b)?|ccc(?:cccc?)?|[abc])
[download]

Comment on Re: Partitioning a set of strings by regular expressions Select or Download Code

Replies are listed 'Best First'.
Re^2: Partitioning a set of strings by regular expressions by Locutus (Beadle) on May 11, 2020 at 12:58 UTC
Thanks for your suggestions! Will take a closer look at each of them. In fact, I'd definitely not want to take the exakt number of character repetitions into account. My example simplifies the sort of strings I really have to deal with but think of S as a sample of input strings already received and of R as the "types" of input strings we can (most likely) ever expect. So it would be completely fine to categorize into `a`'s, `b`'s, and `c`'s.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Partitioning a set of strings by regular expressions
by Locutus (Beadle) on May 11, 2020 at 12:58 UTC

Thanks for your suggestions! Will take a closer look at each of them.

In fact, I'd definitely not want to take the exakt number of character repetitions into account. My example simplifies the sort of strings I really have to deal with but think of S as a sample of input strings already received and of R as the "types" of input strings we can (most likely) ever expect. So it would be completely fine to categorize into a's, b's, and c's.

[reply]
[d/l]
[select]