Locutus has asked for the wisdom of the Perl Monks concerning the following question:

Dear Fellow Monks,

Let S be a set of given strings containing no newline character. I would like to "categorize" the strings in S by constructing a set R of n regular expressions r1, ..., rn of the form qr/^...$/ such that S1, ..., Sn - where Si denotes the subset of S consisting of the strings matched by ri (i=1, ..., n) - is a partition of S.

Needless to say I'm not interested in trivial partitions like induced by R = { qr/^.*$/ } or R = { qr/^s$/ for s in S }. The set R should rather be minimal in a sense I'm still unsure of how to define it but the following example should make it clear: For S = { a, b, c, aa, bb, ccc, aaaaa, bbb, cccccc, aaaaaaa, bbbb, ccccccc } I would like to construct R = { qr/^a+$/, qr/^b+$/, qr/^c+$/ }.

Do you know of a CPAN module that solves this problem? Or of any research papers that deal with it? I might have been searching for the wrong terms but Google hasn't been able to point me to something useful, yet.

Replies are listed 'Best First'.
Re: Partitioning a set of strings by regular expressions
by haukex (Archbishop) on May 11, 2020 at 12:46 UTC

    Although these modules don't solve the problem you're asking directly, maybe they are a starting point: Regexp::Trie, Regexp::Assemble, Regex::PreSuf. Note that in your problem statement, using the a strings as an example, you're saying that you're looking for either 1, 2, 5, or 7 consecutive a's, but a+ actually matches more than that. Note how these optimizers are taking the number of a's into account.

    use warnings; use strict; my @strs = qw/a b c aa bb ccc aaaaa bbb cccccc aaaaaaa bbbb ccccccc/; use Regexp::Trie; my $rt = Regexp::Trie->new; $rt->add($_) for @strs; print $rt->regexp, "\n"; # (?^:(?:a(?:a(?:aaa(?:aa)?)?)?|b(?:b(?:bb?)?)?|c(?:cc(?:cccc?)?)?)) use Regexp::Assemble; my $ra = Regexp::Assemble->new; $ra->add($_) for @strs; print $ra->re, "\n"; # (?^:(?:a(?:a(?:(?:aa)?aaa)?)?|c(?:cc(?:c?ccc)?)?|b(?:b(?:b?b)?)?)) use Regex::PreSuf; my $re = presuf(@strs); print $re, "\n"; # (?:aa(?:aaa(?:aa)?)?|bb(?:bb|b)?|ccc(?:cccc?)?|[abc])

      Thanks for your suggestions! Will take a closer look at each of them.

      In fact, I'd definitely not want to take the exakt number of character repetitions into account. My example simplifies the sort of strings I really have to deal with but think of S as a sample of input strings already received and of R as the "types" of input strings we can (most likely) ever expect. So it would be completely fine to categorize into a's, b's, and c's.

Re: Partitioning a set of strings by regular expressions
by Athanasius (Archbishop) on May 11, 2020 at 13:02 UTC

    Hello Locutus,

    I assume your sample data are (over-)simplified for the purposes of illustration? The problem is, I find it hard to extrapolate from the example given to a more realistic scenario. Let S = { treasure, tree, train }. Do you want to construct the partition S = { treasure, tree } ∪ { train } by distinguishing words beginning with tre from others? Or should S be a single partition of words beginning tr? I think a more complex example would go a long way towards clarifying what you are trying to accomplish.

    A preliminary search of the CPAN threw up the String::REPartition module. While this isn’t what you’re looking for, it might give you some useful ideas.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Hello, Athanasius!

      String::REpartition looks very promising, indeed. I wonder why I have missed it so far but I knew something like that must be there in CPAN :)

      Regarding the real data I have to deal with: They are citations in different styles. The final goal is to extract the volume number, the issue number, the publication year (or even date), the start page, the end page etc. - independent of how creatively these pieces of information have been wrapped in context (literally).

        They are citations in different styles. The final goal is to extract the volume number, the issue number, the publication year (or even date), the start page, the end page etc. - independent of how creatively these pieces of information have been wrapped in context (literally).

        The XY problem is revealed: you are actually trying to extract meaningful information by parsing tag soup.

        This will probably work best by using a set of manually-curated patterns and a "reject bin" for input that matches none of them. Beware that there are good reasons for the efforts to standardize citations and some styles may be completely ambiguous, even to a human reader.

        Heuristics will probably be very helpful to exclude invalid parses: simple rules like start pages must be numbered lower than end pages, publication years must be in the modern era, numbers must be integers, etc. A database of volume/issue tuples that actually exist for various publications could be helpful as well. The few works published prior to the modern era that would be likely to appear in your input are probably best handled as special cases.

        Good luck in your efforts.

Re: Partitioning a set of strings by regular expressions
by perlfan (Parson) on May 12, 2020 at 02:48 UTC
Re: Partitioning a set of strings by regular expressions
by bliako (Abbot) on May 12, 2020 at 15:52 UTC

    This module Biblio::Citation::Parser::Standard uses template matching to match bibliography input. It has collected as it claims, 400 templates. You can see them and add to them I guess.

    Another approach: assuming you can easily extract the title, an author and/or DOI/ISBN/etc. (I mean getting the title is not as difficult as getting the vol.) then do a search on these terms in any scientific publisher like ScienceDirect, Springer et al. And try to get back the result in a standard format like BibTex. Or, perhaps there is a huge database of scientific publications which you can fuzzy-match with your data. I am not talking manually doing the search but create an automatic tool re: web-scrapper etc.

    There are also tools that they claim they can do this sort of thing: AnyStyle.io and Crossref.org they are open source and they may have a web api.

    A combination of all of the above can be successful.

Re: Partitioning a set of strings by regular expressions
by AnomalousMonk (Archbishop) on May 11, 2020 at 20:35 UTC

      Thanks for these pointers, AnomalousMonk!

      If I already had some code to let the monks look at I'd be happy to follow your advice. By giving a very simplified example of data to work on I tried to be both short and self-contained without "hiding" my post from other seekers with a completely different instance of the same general problem. That my original post triggered one or two questions asked back to me (which I think I have answered in a prompt manner) might rather be due to the fact that this is a communication platform than due to an inefficiently stated problem description?

        That my original post triggered one or two questions asked back to me (which I think I have answered in a prompt manner) might rather be due to the fact that this is a communication platform than due to an inefficiently stated problem description?

        This is indeed a communication platform, and discussions are very welcome to help find the best solution. As for the problem statement, while not bad and stated quite formally, IMHO it oversimplifies a little bit too much: the test cases you named are so simple (only repeating letters) that it could be solved with very simple solutions that wouldn't have adressed your actual problem - so it definitely wouldn't hurt to mention that you're actually trying to parse citations, lest it become an XY Problem. (Note Short, Self-Contained, Correct Example generally refers to the code, not necessarily the problem statement.)

        IMHO in this case it sounds like developing heuristics with some validation, like what jcb described, while not as nice and "pure" of a solution, will probably get you up and running fastest.