Yeah, parsing free form stuff is hard :) but its not exactly freeform, its some kind of https://en.wikipedia.org/wiki/Citation#Sciences.2C_mathematics.2C_engineering.2C_physiology.2C_and_medicine ... so something something Biblio::Refbase???
sub breakup { my $filename = $_[0]; my $text = $_[1]; my $re = qr{ (?<apair> (?<key>[^:;,\.\s]+) \s* : \s* (?<val>[^:;,\.\s]+) \s* ; ) |(?<ppair> (?<key> \p{Uppercase_Letter}w+ ) : (?<val> \p{Uppercase_Letter}+ ) ) | (?<ppage> pp\.\s*\d+-\d+\. ) | (?<version3> \d+\.\d+\.\d+ (?::\d+)? ) | (?<version2> \d+\.\d+:\d+ ) | (?<year> \( \d+ \) ) ##| (?<name> (?: [^\s,\.]+, )+ ) | (?<name> (?: \p{Uppercase_Letter}[.\w]+, )+ ) ## | (?<title> (?:[^\s;.,\(\)]+\s*)+[.;,] ) | (?<title> (?:[^\s;.,\(\)]+\s*){3,} [.;,?!]? ) | (?<comma>,) | (?<notcommanotspace>[^\s,]+) | (?<space>\s+) | (?<other>.) }msx; my @parts; while( $text =~ m{$re}g ){ my $it = { %+ }; next if $it->{space}; push @parts, $it; } return \@parts ; } __END__ [ { year => "(1990)" }, { title => "Methods of the Association of Official Analytical Chemis +ts.", }, { notcommanotspace => "15." }, { name => "Ed," }, { comma => "," }, { title => "Association of Official Analytical Chemists Washington;" +, }, { name => "Dumet," }, { name => "D.," }, { name => "Benson," }, { name => "E.E.," }, { title => "The use of physical and biochemical studies to elucidate + and reduce cryopreservation-induced damage in hydrated/desiccated pl +ant germplasm ", }, { year => "(2000)" }, { title => "Cryopreservation of Tropical Plant Germplasm: Current Re +search Progress and Application,", }, { ppage => "pp. 43-56." }, { comma => "," }, { notcommanotspace => "F." }, { title => "Engelmann and H." }, { title => "Takagi " }, { notcommanotspace => "(eds.)" }, { apair => "Tsukuba: JIRCAS;", key => "Tsukuba", val => "JIRCAS" }, { apair => "Rome: IPGRI;", key => "Rome", val => "IPGRI" }, { name => "Ferreira," }, { name => "D.F.," }, { title => "An\x{2B69}ses estat\x{ED34}icas por meio do SISVAR para +Windows vers\x{4BE0}4.", }, { version2 => "0.1:225" }, { year => "(2000)" }, { title => "Reuni\x{4BE0}Anual da Regi\x{4BE0}Brasileira da Sociedad +e Internacional de Biometria,", }, { comma => "," }, { name => "S\x{4BE0}Carlos," }, { title => "SP: UFSCAR" }, ]

In reply to Re^3: Regex for splitting a string on a semicolon (conditionally) by Anonymous Monk
in thread Regex for splitting a string on a semicolon (conditionally) by grouse

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.