comment on

Yeah, parsing free form stuff is hard :) but its not exactly freeform, its some kind of https://en.wikipedia.org/wiki/Citation#Sciences.2C_mathematics.2C_engineering.2C_physiology.2C_and_medicine ... so something something Biblio::Refbase???

sub breakup {
    my $filename = $_[0];
    my $text = $_[1];
    my $re = qr{
(?<apair>
    (?<key>[^:;,\.\s]+)
    \s*
    :
    \s*
    (?<val>[^:;,\.\s]+)
    \s*
    ;
)
|(?<ppair>
    (?<key> \p{Uppercase_Letter}w+ )
    :
    (?<val> \p{Uppercase_Letter}+ )
)
| (?<ppage>  pp\.\s*\d+-\d+\. )
| (?<version3>  \d+\.\d+\.\d+ (?::\d+)? )
| (?<version2>  \d+\.\d+:\d+ )
| (?<year>  \( \d+ \) )
##| (?<name>  (?:  [^\s,\.]+, )+   )
| (?<name>  (?:  \p{Uppercase_Letter}[.\w]+, )+   )
## | (?<title>  (?:[^\s;.,\(\)]+\s*)+[.;,] )
| (?<title>  (?:[^\s;.,\(\)]+\s*){3,} [.;,?!]? )
| (?<comma>,)
| (?<notcommanotspace>[^\s,]+)
| (?<space>\s+)
| (?<other>.)

    }msx;
    my @parts;
    while( $text =~ m{$re}g ){
        my $it   = { %+ };
        next if $it->{space};
        push @parts, $it;
    }
    return \@parts ;
}

__END__
[
  { year => "(1990)" },
  {
    title => "Methods of the Association of Official Analytical Chemis
+ts.",
  },
  { notcommanotspace => "15." },
  { name => "Ed," },
  { comma => "," },
  {
    title => "Association of Official Analytical Chemists Washington;"
+,
  },
  { name => "Dumet," },
  { name => "D.," },
  { name => "Benson," },
  { name => "E.E.," },
  {
    title => "The use of physical and biochemical studies to elucidate
+ and reduce cryopreservation-induced damage in hydrated/desiccated pl
+ant germplasm ",
  },
  { year => "(2000)" },
  {
    title => "Cryopreservation of Tropical Plant Germplasm: Current Re
+search Progress and Application,",
  },
  { ppage => "pp. 43-56." },
  { comma => "," },
  { notcommanotspace => "F." },
  { title => "Engelmann and H." },
  { title => "Takagi " },
  { notcommanotspace => "(eds.)" },
  { apair => "Tsukuba: JIRCAS;", key => "Tsukuba", val => "JIRCAS" },
  { apair => "Rome: IPGRI;", key => "Rome", val => "IPGRI" },
  { name => "Ferreira," },
  { name => "D.F.," },
  {
    title => "An\x{2B69}ses estat\x{ED34}icas por meio do SISVAR para 
+Windows vers\x{4BE0}4.",
  },
  { version2 => "0.1:225" },
  { year => "(2000)" },
  {
    title => "Reuni\x{4BE0}Anual da Regi\x{4BE0}Brasileira da Sociedad
+e Internacional de Biometria,",
  },
  { comma => "," },
  { name => "S\x{4BE0}Carlos," },
  { title => "SP: UFSCAR" },
]
[download]

In reply to Re^3: Regex for splitting a string on a semicolon (conditionally) by Anonymous Monk
in thread Regex for splitting a string on a semicolon (conditionally) by grouse

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.