Re^3: Partitioning a set of strings by regular expressions

They are citations in different styles. The final goal is to extract the volume number, the issue number, the publication year (or even date), the start page, the end page etc. - independent of how creatively these pieces of information have been wrapped in context (literally).

The XY problem is revealed: you are actually trying to extract meaningful information by parsing tag soup.

This will probably work best by using a set of manually-curated patterns and a "reject bin" for input that matches none of them. Beware that there are good reasons for the efforts to standardize citations and some styles may be completely ambiguous, even to a human reader.

Heuristics will probably be very helpful to exclude invalid parses: simple rules like start pages must be numbered lower than end pages, publication years must be in the modern era, numbers must be integers, etc. A database of volume/issue tuples that actually exist for various publications could be helpful as well. The few works published prior to the modern era that would be likely to appear in your input are probably best handled as special cases.

Good luck in your efforts.

Comment on Re^3: Partitioning a set of strings by regular expressions

Replies are listed 'Best First'.
Re^4: Partitioning a set of strings by regular expressions by Locutus (Beadle) on May 12, 2020 at 15:11 UTC
Well, "tag soup" sounds more chaotic than it turned out to be. At the moment I have a text file with > 2M citations recorded by several different catalogers. If you let your eyes fly over that list top down you do see that finding patterns each of which matching quite a huge bunch of these citations should be impossible and tackling that task as you suggested was my first intuition. However, I felt there might be a less tedious way to do it :-) I am well aware of the various efforts to standardize citations but the core problem seems to be their variety...	[reply]
Re^5: Partitioning a set of strings by regular expressions by hippo (Archbishop) on May 12, 2020 at 15:44 UTC
However, I felt there might be a less tedious way to do it :-) Well, there's always the brute force approach. I am well aware of the various efforts to standardize citations but the core problem seems to be their variety... ObXkcd	[reply]