They are citations in different styles. The final goal is to extract the volume number, the issue number, the publication year (or even date), the start page, the end page etc. - independent of how creatively these pieces of information have been wrapped in context (literally).

The XY problem is revealed: you are actually trying to extract meaningful information by parsing tag soup.

This will probably work best by using a set of manually-curated patterns and a "reject bin" for input that matches none of them. Beware that there are good reasons for the efforts to standardize citations and some styles may be completely ambiguous, even to a human reader.

Heuristics will probably be very helpful to exclude invalid parses: simple rules like start pages must be numbered lower than end pages, publication years must be in the modern era, numbers must be integers, etc. A database of volume/issue tuples that actually exist for various publications could be helpful as well. The few works published prior to the modern era that would be likely to appear in your input are probably best handled as special cases.

Good luck in your efforts.


In reply to Re^3: Partitioning a set of strings by regular expressions by jcb
in thread Partitioning a set of strings by regular expressions by Locutus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.