in reply to Re^2: Partitioning a set of strings by regular expressions
in thread Partitioning a set of strings by regular expressions
They are citations in different styles. The final goal is to extract the volume number, the issue number, the publication year (or even date), the start page, the end page etc. - independent of how creatively these pieces of information have been wrapped in context (literally).
The XY problem is revealed: you are actually trying to extract meaningful information by parsing tag soup.
This will probably work best by using a set of manually-curated patterns and a "reject bin" for input that matches none of them. Beware that there are good reasons for the efforts to standardize citations and some styles may be completely ambiguous, even to a human reader.
Heuristics will probably be very helpful to exclude invalid parses: simple rules like start pages must be numbered lower than end pages, publication years must be in the modern era, numbers must be integers, etc. A database of volume/issue tuples that actually exist for various publications could be helpful as well. The few works published prior to the modern era that would be likely to appear in your input are probably best handled as special cases.
Good luck in your efforts.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: Partitioning a set of strings by regular expressions
by Locutus (Beadle) on May 12, 2020 at 15:11 UTC | |
by hippo (Archbishop) on May 12, 2020 at 15:44 UTC |