in reply to Regex for splitting a string on a semicolon (conditionally)

This format looks rather messed up, I doubt that your heuristics will be 100% correct and that Text::CSV can help you here.

Though your problem is that split also returns matched (...) groupings. Disable capturing groupings with (?:...) .

see perlretut#Non-capturing-groupings

update

> followed by a name (these sometimes have hyphens, but I'm not sure I've accounted for that), followed by initials.

I strongly recommend recomposing with sub-regexes like $name and $initial plus /x -flag for readability and maintainability.

you'd quickly discover some bugs I was able to spot (hint: initials are capital letters and A-z is not what you want,, the first entry didn't start with a name but a date... and so on)

Good luck!

Cheers Rolf

PS: Je suis Charlie!

Replies are listed 'Best First'.
Re^2: Regex for splitting a string on a semicolon (conditionally)
by grouse (Initiate) on Feb 12, 2015 at 09:03 UTC

    The format is definitely not optimal, but this is the only way I can get this data from the source I'm using. I can have 10% loss on the data without being in trouble, though, so I'm trying to make due. The Text::CSV is for filing the sorted references into an output file--sorry to have included this red herring here!

    Thank you so much for the capturing advice! I didn't know that captures were stored into arrays in this context, but after blocking them everything works. I've never considered putting in named sub-regexes--thanks for the tip.

    This is one of several steps in sorting the messy data--other areas deal with references with no authors. Some of the entries don't have capital initials, which is why I had A-z rather than A-Z. I didn't want to post a ton of lines of code (or examples--there are thousands) when my problem was just one regex operation.

      Some of the entries don't have capital initials, which is why I had A-z rather than A-Z.

      You may already be aware of this, but the important thing to remember about the  A-z character class range is that it includes the  [ \ ] ^ _ ` characters as well, and  A-Za-z does not.


      Give a man a fish:  <%-(-(-(-<

      Yeah, parsing free form stuff is hard :) but its not exactly freeform, its some kind of https://en.wikipedia.org/wiki/Citation#Sciences.2C_mathematics.2C_engineering.2C_physiology.2C_and_medicine ... so something something Biblio::Refbase???
      sub breakup { my $filename = $_[0]; my $text = $_[1]; my $re = qr{ (?<apair> (?<key>[^:;,\.\s]+) \s* : \s* (?<val>[^:;,\.\s]+) \s* ; ) |(?<ppair> (?<key> \p{Uppercase_Letter}w+ ) : (?<val> \p{Uppercase_Letter}+ ) ) | (?<ppage> pp\.\s*\d+-\d+\. ) | (?<version3> \d+\.\d+\.\d+ (?::\d+)? ) | (?<version2> \d+\.\d+:\d+ ) | (?<year> \( \d+ \) ) ##| (?<name> (?: [^\s,\.]+, )+ ) | (?<name> (?: \p{Uppercase_Letter}[.\w]+, )+ ) ## | (?<title> (?:[^\s;.,\(\)]+\s*)+[.;,] ) | (?<title> (?:[^\s;.,\(\)]+\s*){3,} [.;,?!]? ) | (?<comma>,) | (?<notcommanotspace>[^\s,]+) | (?<space>\s+) | (?<other>.) }msx; my @parts; while( $text =~ m{$re}g ){ my $it = { %+ }; next if $it->{space}; push @parts, $it; } return \@parts ; } __END__ [ { year => "(1990)" }, { title => "Methods of the Association of Official Analytical Chemis +ts.", }, { notcommanotspace => "15." }, { name => "Ed," }, { comma => "," }, { title => "Association of Official Analytical Chemists Washington;" +, }, { name => "Dumet," }, { name => "D.," }, { name => "Benson," }, { name => "E.E.," }, { title => "The use of physical and biochemical studies to elucidate + and reduce cryopreservation-induced damage in hydrated/desiccated pl +ant germplasm ", }, { year => "(2000)" }, { title => "Cryopreservation of Tropical Plant Germplasm: Current Re +search Progress and Application,", }, { ppage => "pp. 43-56." }, { comma => "," }, { notcommanotspace => "F." }, { title => "Engelmann and H." }, { title => "Takagi " }, { notcommanotspace => "(eds.)" }, { apair => "Tsukuba: JIRCAS;", key => "Tsukuba", val => "JIRCAS" }, { apair => "Rome: IPGRI;", key => "Rome", val => "IPGRI" }, { name => "Ferreira," }, { name => "D.F.," }, { title => "An\x{2B69}ses estat\x{ED34}icas por meio do SISVAR para +Windows vers\x{4BE0}4.", }, { version2 => "0.1:225" }, { year => "(2000)" }, { title => "Reuni\x{4BE0}Anual da Regi\x{4BE0}Brasileira da Sociedad +e Internacional de Biometria,", }, { comma => "," }, { name => "S\x{4BE0}Carlos," }, { title => "SP: UFSCAR" }, ]