Regex for splitting a string on a semicolon (conditionally)

grouse has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone! I have been working on this code for a while, looked up a bunch of threads here, but the regex is acting in ways I can't seem to find an explanation for.

I have a list of scientific references separated by a semicolon (but also occasionally containing a semicolon), and I want each one to be put into a separate element of an array for further processing.

I want to split on a semicolon, but only if it occurs before a space, followed by a name (these sometimes have hyphens, but I'm not sure I've accounted for that), followed by initials.

Here's a snippit of the problem area:

#!/usr/bin/perl

use strict;
use warnings;
use List::MoreUtils 'first_index';
use Text::CSV;
use utf8;


sub breakup {
    my $filename = $_[0];
    my $text = $_[1];
    if ($text =~ /\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\,))/) {
        my @parts = split (/\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\,
+))/,$text);
        foreach (@parts) {
            print $_ . "\n";
        }
    }
}

breakup("scopus0842","(1990) Methods of the Association of Official An
+alytical Chemists. 15. Ed, , Association of Official Analytical Chemi
+sts Washington; Dumet, D., Benson, E.E., The use of physical and bioc
+hemical studies to elucidate and reduce cryopreservation-induced dama
+ge in hydrated/desiccated plant germplasm (2000) Cryopreservation of 
+Tropical Plant Germplasm: Current Research Progress and Application, 
+pp. 43-56. , F. Engelmann and H. Takagi (eds.) Tsukuba: JIRCAS; Rome:
+ IPGRI; Ferreira, D.F., Análises estatísticas por meio do SISVAR para
+ Windows versão 4.0.1:225 (2000) Reunião Anual da Região Brasileira d
+a Sociedade Internacional de Biometria, , São Carlos, SP: UFSCAR");
[download]

What I want the regex to give me:

(1990) Methods of the Association of Official Analytical Chemists. 15.
+ Ed, , Association of Official Analytical Chemists Washington

Dumet, D., Benson, E.E., The use of physical and biochemical studies t
+o elucidate and reduce cryopreservation-induced damage in hydrated/de
+siccated plant germplasm (2000) Cryopreservation of Tropical Plant Ge
+rmplasm: Current Research Progress and Application, pp. 43-56. , F. E
+ngelmann and H. Takagi (eds.) Tsukuba: JIRCAS; Rome: IPGRI

Ferreira, D.F., Análises estatísticas por meio do SISVAR para Windows 
+versão 4.0.1:225 (2000) Reunião Anual da Região Brasileira da Socieda
+de Internacional de Biometria, , São Carlos, SP: UFSCAR
[download]

What I actually get

(1990) Methods of the Association of Official Analytical Chemists. 15.
+ Ed, , Association of Official Analytical Chemists Washington
 Dumet, D.,
D.
Dumet, D., Benson, E.E., The use of physical and biochemical studies t
+o elucidate and reduce cryopreservation-induced damage in hydrated/de
+siccated plant germplasm (2000) Cryopreservation of Tropical Plant Ge
+rmplasm: Current Research Progress and Application, pp. 43-56. , F. E
+ngelmann and H. Takagi (eds.) Tsukuba: JIRCAS; Rome: IPGRI
 Ferreira, D.F.,
F.
Ferreira, D.F., Análises estatísticas por meio do SISVAR para Windows 
+versão 4.0.1:225 (2000) Reunião Anual da Região Brasileira da Socieda
+de Internacional de Biometria, , São Carlos, SP: UFSCAR
[download]

How can I rewrite the regex so that it doesn't echo the author and their initial? I also tried to substitute the semicolon with "\n" using the same regex...

my $part = $text =~ s/\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\,))/\n/
+g;
my @parts = split("\n",$part);
[download]

...but this returns in the number of matching semicolons ("2" in this case). This has been driving me absolutely nuts--any help is appreciated!

Comment on Regex for splitting a string on a semicolon (conditionally) Select or Download Code

Replies are listed 'Best First'.
Re: Regex for splitting a string on a semicolon (conditionally) by LanX (Saint) on Feb 12, 2015 at 04:24 UTC
This format looks rather messed up, I doubt that your heuristics will be 100% correct and that Text::CSV can help you here. Though your problem is that `split` also returns matched `(...)` groupings. Disable capturing groupings with `(?:...)` . see `perlretut#Non-capturing-groupings` update > followed by a name (these sometimes have hyphens, but I'm not sure I've accounted for that), followed by initials. I strongly recommend recomposing with sub-regexes like `$name` and `$initial` plus `/x` -flag for readability and maintainability. you'd quickly discover some bugs I was able to spot (hint: initials are capital letters and `A-z` is not what you want,, the first entry didn't start with a name but a date... and so on) Good luck! Cheers Rolf PS: Je suis Charlie!	[reply] [d/l] [select]
Re^2: Regex for splitting a string on a semicolon (conditionally) by grouse (Initiate) on Feb 12, 2015 at 09:03 UTC
The format is definitely not optimal, but this is the only way I can get this data from the source I'm using. I can have 10% loss on the data without being in trouble, though, so I'm trying to make due. The Text::CSV is for filing the sorted references into an output file--sorry to have included this red herring here! Thank you so much for the capturing advice! I didn't know that captures were stored into arrays in this context, but after blocking them everything works. I've never considered putting in named sub-regexes--thanks for the tip. This is one of several steps in sorting the messy data--other areas deal with references with no authors. Some of the entries don't have capital initials, which is why I had `A-z` rather than `A-Z`. I didn't want to post a ton of lines of code (or examples--there are thousands) when my problem was just one regex operation.	[reply] [d/l] [select]
Re^3: Regex for splitting a string on a semicolon (conditionally) by AnomalousMonk (Archbishop) on Feb 12, 2015 at 18:26 UTC
Some of the entries don't have capital initials, which is why I had `A-z` rather than `A-Z`. You may already be aware of this, but the important thing to remember about the `A-z` character class range is that it includes the [ \ ] ^ _ ` characters as well, and `A-Za-z` does not. Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]
Re^3: Regex for splitting a string on a semicolon (conditionally) by Anonymous Monk on Feb 12, 2015 at 09:19 UTC
Yeah, parsing free form stuff is hard :) but its not exactly freeform, its some kind of https://en.wikipedia.org/wiki/Citation#Sciences.2C_mathematics.2C_engineering.2C_physiology.2C_and_medicine ... so something something Biblio::Refbase??? sub breakup { my $filename = $_[0]; my $text = $_[1]; my $re = qr{ (?<apair> (?<key>[^:;,\.\s]+) \s* : \s* (?<val>[^:;,\.\s]+) \s* ; ) \|(?<ppair> (?<key> \p{Uppercase_Letter}w+ ) : (?<val> \p{Uppercase_Letter}+ ) ) \| (?<ppage> pp\.\s\d+-\d+\. ) \| (?<version3> \d+\.\d+\.\d+ (?::\d+)? ) \| (?<version2> \d+\.\d+:\d+ ) \| (?<year> $ \d+ $ ) ##\| (?<name> (?: [^\s,\.]+, )+ ) \| (?<name> (?: \p{Uppercase_Letter}[.\w]+, )+ ) ## \| (?<title> (?:[^\s;.,]+\s)+[.;,] ) \| (?<title> (?:[^\s;.,]+\s*){3,} [.;,?!]? ) \| (?<comma>,) \| (?<notcommanotspace>[^\s,]+) \| (?<space>\s+) \| (?<other>.) }msx; my @parts; while( $text =~ m{$re}g ){ my $it = { %+ }; next if $it->{space}; push @parts, $it; } return \@parts ; } __END__ [ { year => "(1990)" }, { title => "Methods of the Association of Official Analytical Chemis +ts.", }, { notcommanotspace => "15." }, { name => "Ed," }, { comma => "," }, { title => "Association of Official Analytical Chemists Washington;" +, }, { name => "Dumet," }, { name => "D.," }, { name => "Benson," }, { name => "E.E.," }, { title => "The use of physical and biochemical studies to elucidate + and reduce cryopreservation-induced damage in hydrated/desiccated pl +ant germplasm ", }, { year => "(2000)" }, { title => "Cryopreservation of Tropical Plant Germplasm: Current Re +search Progress and Application,", }, { ppage => "pp. 43-56." }, { comma => "," }, { notcommanotspace => "F." }, { title => "Engelmann and H." }, { title => "Takagi " }, { notcommanotspace => "(eds.)" }, { apair => "Tsukuba: JIRCAS;", key => "Tsukuba", val => "JIRCAS" }, { apair => "Rome: IPGRI;", key => "Rome", val => "IPGRI" }, { name => "Ferreira," }, { name => "D.F.," }, { title => "An\x{2B69}ses estat\x{ED34}icas por meio do SISVAR para +Windows vers\x{4BE0}4.", }, { version2 => "0.1:225" }, { year => "(2000)" }, { title => "Reuni\x{4BE0}Anual da Regi\x{4BE0}Brasileira da Sociedad +e Internacional de Biometria,", }, { comma => "," }, { name => "S\x{4BE0}Carlos," }, { title => "SP: UFSCAR" }, ] [download]	[reply] [d/l]
Re: Regex for splitting a string on a semicolon (conditionally) by AnomalousMonk (Archbishop) on Feb 12, 2015 at 05:14 UTC
I strongly second the point made in the reply of LanX that you need to re-factor the unholy `split` regex you are attempting to use into something that is both readable and independently testable. (This assumes that Text::CSV will be of little help, which I also suspect is true.) Ideally, you will wind up with something like `my @parts = split /;(?=\s$name)/, $text;` (or `qr{ , (?= \s $name) }xms` as I would rather write the regex). This leaves you with the still substantial problem of defining `my $name = qr{ ... }xms;` which will not be easy, but at least you will have an independently testable pattern at which you can throw "names" galore, as many as you can possibly think of, in order to test its efficacy. Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]
Re: Regex for splitting a string on a semicolon (conditionally) by LanX (Saint) on Feb 12, 2015 at 04:51 UTC
> `my $part = $text =~ s/\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\,))/\n/g;` > ...but this returns in the number of matching semicolons ("2" in this case). This has been driving me absolutely nuts--any help is appreciated! `$text =~ s/\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\,))/\n/g;` does the substitution in `$text` but returns their number in scalar context. drop `$part` and do `my @parts = split("\n",$text);` Cheers Rolf PS: Je suis Charlie!	[reply] [d/l] [select]
Re^2: Regex for splitting a string on a semicolon (conditionally) by grouse (Initiate) on Feb 12, 2015 at 09:16 UTC
Thank you so much. I feel somewhat embarassed for missing that one! Making the change makes everything work as intended, thanks!	[reply]
Re: Regex for splitting a string on a semicolon (conditionally) by Ea (Chaplain) on Feb 12, 2015 at 10:18 UTC
Having found this out last summer, extracting citations from papers reliably is hard. If you're getting the references direct (no pre-cleaning), then after you solve the semi-colon problem you're likely to find about 1-5% of your citations fail on something else because of how an author has formatted the reference in an intelligible, but machine-confounding, fashion (or mistyped or just missed a bit). You end up with regex after regex, just to catch all the perversities. The ADS project has likely had to solve this problem. You may be able to find some of their utilities on github. Out of almost 3000 papers, I had to find references for about 50 by hand using the search interface and some intuition. The most innocent unfindable reference was in this paper where A. Einstein, Sit. Preus. Akad. (1919). would have been identifiable if Albert hadn't gone and published 2 papers in ~~that journal~~ that year. Update: I found it! Six months and no looming deadline has really improved my German. Sometimes I can think of 6 impossible LDAP attributes before breakfast.	[reply]
Re: Regex for splitting a string on a semicolon (conditionally) (use Text::CSV;) by Anonymous Monk on Feb 12, 2015 at 03:46 UTC
... use Text::CSV; ... sub breakup { ... Why aren't you using Text::CSV inside sub breakup {? Including "use Text::CSV;" isn't some magic spell that will avoid the suggestion that you actually use it	[reply]
Re^2: Regex for splitting a string on a semicolon (conditionally) (use Text::CSV;) by grouse (Initiate) on Feb 12, 2015 at 08:43 UTC
I'm using it in the larger context of my code to output the data (once broken up into individual references, then into author, year, etc columns) into a .csv file. It has nothing to do with this subroutine--I just copied the beginning of the script verbatum. Sorry for the confusion!	[reply]
Re^3: Regex for splitting a string on a semicolon (conditionally) by Anonymous Monk on Feb 12, 2015 at 08:59 UTC
Ok :) here you go, "Rome: IPGRI" gets seperated from its owner, then its put back `sub breakup { my $filename = $_[0]; my $text = $_[1]; my @stuff = split /;\s?/, $text; my @parts ; while( @stuff ){ if( $stuff[0] =~ m{ ^ \w+ : \s \w+ $ }xms ){ $parts[-1] .= ";".$stuff[0]; } else { push @parts, $stuff[0]; } shift @stuff; } return \@parts; }` [download]	[reply] [d/l]
Re: Regex for splitting a string on a semicolon (conditionally) by Anonymous Monk on Feb 12, 2015 at 04:47 UTC
Why is your text in Latin-1? Is that Perlmonks or is it really Latin-1? You `use utf8`, do you get warnings about malformed characters?	[reply] [d/l]
Re^2: Regex for splitting a string on a semicolon (conditionally) by Anonymous Monk on Feb 12, 2015 at 07:18 UTC
Re^4: Alphabetize in Esperanto ( perlmonks faq doesn't unicode or utf8 or utf-8 it only latin1 or windows-1252 or something like that )	[reply]

update