Partitioning a set of strings by regular expressions

Locutus has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Partitioning a set of strings by regular expressions
by haukex (Archbishop) on May 11, 2020 at 12:46 UTC

Although these modules don't solve the problem you're asking directly, maybe they are a starting point: Regexp::Trie, Regexp::Assemble, Regex::PreSuf. Note that in your problem statement, using the a strings as an example, you're saying that you're looking for either 1, 2, 5, or 7 consecutive a's, but a+ actually matches more than that. Note how these optimizers are taking the number of a's into account.

use warnings;
use strict;

my @strs = qw/a b c aa bb ccc aaaaa bbb cccccc aaaaaaa bbbb ccccccc/;

use Regexp::Trie;
my $rt = Regexp::Trie->new;
$rt->add($_) for @strs;
print $rt->regexp, "\n";
# (?^:(?:a(?:a(?:aaa(?:aa)?)?)?|b(?:b(?:bb?)?)?|c(?:cc(?:cccc?)?)?))

use Regexp::Assemble;
my $ra = Regexp::Assemble->new;
$ra->add($_) for @strs;
print $ra->re, "\n";
# (?^:(?:a(?:a(?:(?:aa)?aaa)?)?|c(?:cc(?:c?ccc)?)?|b(?:b(?:b?b)?)?))

use Regex::PreSuf;
my $re = presuf(@strs);
print $re, "\n";
# (?:aa(?:aaa(?:aa)?)?|bb(?:bb|b)?|ccc(?:cccc?)?|[abc])
[download]

[reply]
[d/l]
[select]

Re^2: Partitioning a set of strings by regular expressions

by Locutus (Beadle) on May 11, 2020 at 12:58 UTC

Thanks for your suggestions! Will take a closer look at each of them.

In fact, I'd definitely not want to take the exakt number of character repetitions into account. My example simplifies the sort of strings I really have to deal with but think of S as a sample of input strings already received and of R as the "types" of input strings we can (most likely) ever expect. So it would be completely fine to categorize into a's, b's, and c's.

[reply]
[d/l]
[select]

Re: Partitioning a set of strings by regular expressions
by Athanasius (Archbishop) on May 11, 2020 at 13:02 UTC

Hello Locutus,

I assume your sample data are (over-)simplified for the purposes of illustration? The problem is, I find it hard to extrapolate from the example given to a more realistic scenario. Let S = { treasure, tree, train }. Do you want to construct the partition S = { treasure, tree } ∪ { train } by distinguishing words beginning with tre from others? Or should S be a single partition of words beginning tr? I think a more complex example would go a long way towards clarifying what you are trying to accomplish.

A preliminary search of the CPAN threw up the String::REPartition module. While this isn’t what you’re looking for, it might give you some useful ideas.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Partitioning a set of strings by regular expressions

by Locutus (Beadle) on May 11, 2020 at 13:56 UTC

Hello, Athanasius!

String::REpartition looks very promising, indeed. I wonder why I have missed it so far but I knew something like that must be there in CPAN :)

Regarding the real data I have to deal with: They are citations in different styles. The final goal is to extract the volume number, the issue number, the publication year (or even date), the start page, the end page etc. - independent of how creatively these pieces of information have been wrapped in context (literally).

[reply]

Re^3: Partitioning a set of strings by regular expressions

by jcb (Parson) on May 12, 2020 at 02:09 UTC

They are citations in different styles. The final goal is to extract the volume number, the issue number, the publication year (or even date), the start page, the end page etc. - independent of how creatively these pieces of information have been wrapped in context (literally).

The XY problem is revealed: you are actually trying to extract meaningful information by parsing tag soup.

This will probably work best by using a set of manually-curated patterns and a "reject bin" for input that matches none of them. Beware that there are good reasons for the efforts to standardize citations and some styles may be completely ambiguous, even to a human reader.

Heuristics will probably be very helpful to exclude invalid parses: simple rules like start pages must be numbered lower than end pages, publication years must be in the modern era, numbers must be integers, etc. A database of volume/issue tuples that actually exist for various publications could be helpful as well. The few works published prior to the modern era that would be likely to appear in your input are probably best handled as special cases.

Good luck in your efforts.

[reply]

Re^4: Partitioning a set of strings by regular expressions

by Locutus (Beadle) on May 12, 2020 at 15:11 UTC

Re^5: Partitioning a set of strings by regular expressions

by hippo (Archbishop) on May 12, 2020 at 15:44 UTC

Re: Partitioning a set of strings by regular expressions
by perlfan (Parson) on May 12, 2020 at 02:48 UTC

https://cs.stackexchange.com/questions/93712/understanding-application-of-ardens-theorem-to-find-regular-expression

Also, this might point you in the right direction with JFLAP (not Perl),

http://www.jflap.org/modules/ConvertedFiles/DFA%20to%20Regular%20Expression%20Conversion%20Module.pdf

While not necessarily Perl regular expressions, Perl does support the classical regular expression operators as defined in formal language theory.

[reply]

Re: Partitioning a set of strings by regular expressions
by bliako (Abbot) on May 12, 2020 at 15:52 UTC

This module Biblio::Citation::Parser::Standard uses template matching to match bibliography input. It has collected as it claims, 400 templates. You can see them and add to them I guess.

Another approach: assuming you can easily extract the title, an author and/or DOI/ISBN/etc. (I mean getting the title is not as difficult as getting the vol.) then do a search on these terms in any scientific publisher like ScienceDirect, Springer et al. And try to get back the result in a standard format like BibTex. Or, perhaps there is a huge database of scientific publications which you can fuzzy-match with your data. I am not talking manually doing the search but create an automatic tool re: web-scrapper etc.

There are also tools that they claim they can do this sort of thing: AnyStyle.io and Crossref.org they are open source and they may have a web api.

A combination of all of the above can be successful.

[reply]

Re: Partitioning a set of strings by regular expressions
by AnomalousMonk (Archbishop) on May 11, 2020 at 20:35 UTC

You may find Short, Self-Contained, Correct Example and in particular How to ask better questions using Test::More and sample data useful when trying to specify algorithm behavior to the monks — and even to yourself!

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]

Re^2: Partitioning a set of strings by regular expressions

by Locutus (Beadle) on May 12, 2020 at 14:48 UTC

Thanks for these pointers, AnomalousMonk!

If I already had some code to let the monks look at I'd be happy to follow your advice. By giving a very simplified example of data to work on I tried to be both short and self-contained without "hiding" my post from other seekers with a completely different instance of the same general problem. That my original post triggered one or two questions asked back to me (which I think I have answered in a prompt manner) might rather be due to the fact that this is a communication platform than due to an inefficiently stated problem description?

[reply]

Re^3: Partitioning a set of strings by regular expressions

by haukex (Archbishop) on May 12, 2020 at 17:03 UTC

That my original post triggered one or two questions asked back to me (which I think I have answered in a prompt manner) might rather be due to the fact that this is a communication platform than due to an inefficiently stated problem description?

This is indeed a communication platform, and discussions are very welcome to help find the best solution. As for the problem statement, while not bad and stated quite formally, IMHO it oversimplifies a little bit too much: the test cases you named are so simple (only repeating letters) that it could be solved with very simple solutions that wouldn't have adressed your actual problem - so it definitely wouldn't hurt to mention that you're actually trying to parse citations, lest it become an XY Problem. (Note Short, Self-Contained, Correct Example generally refers to the code, not necessarily the problem statement.)

IMHO in this case it sounds like developing heuristics with some validation, like what jcb described, while not as nice and "pure" of a solution, will probably get you up and running fastest.

[reply]