comment on

Hello everyone! I have been working on this code for a while, looked up a bunch of threads here, but the regex is acting in ways I can't seem to find an explanation for.

I have a list of scientific references separated by a semicolon (but also occasionally containing a semicolon), and I want each one to be put into a separate element of an array for further processing.

I want to split on a semicolon, but only if it occurs before a space, followed by a name (these sometimes have hyphens, but I'm not sure I've accounted for that), followed by initials.

Here's a snippit of the problem area:

#!/usr/bin/perl

use strict;
use warnings;
use List::MoreUtils 'first_index';
use Text::CSV;
use utf8;


sub breakup {
    my $filename = $_[0];
    my $text = $_[1];
    if ($text =~ /\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\,))/) {
        my @parts = split (/\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\,
+))/,$text);
        foreach (@parts) {
            print $_ . "\n";
        }
    }
}

breakup("scopus0842","(1990) Methods of the Association of Official An
+alytical Chemists. 15. Ed, , Association of Official Analytical Chemi
+sts Washington; Dumet, D., Benson, E.E., The use of physical and bioc
+hemical studies to elucidate and reduce cryopreservation-induced dama
+ge in hydrated/desiccated plant germplasm (2000) Cryopreservation of 
+Tropical Plant Germplasm: Current Research Progress and Application, 
+pp. 43-56. , F. Engelmann and H. Takagi (eds.) Tsukuba: JIRCAS; Rome:
+ IPGRI; Ferreira, D.F., Análises estatísticas por meio do SISVAR para
+ Windows versão 4.0.1:225 (2000) Reunião Anual da Região Brasileira d
+a Sociedade Internacional de Biometria, , São Carlos, SP: UFSCAR");
[download]

What I want the regex to give me:

(1990) Methods of the Association of Official Analytical Chemists. 15.
+ Ed, , Association of Official Analytical Chemists Washington

Dumet, D., Benson, E.E., The use of physical and biochemical studies t
+o elucidate and reduce cryopreservation-induced damage in hydrated/de
+siccated plant germplasm (2000) Cryopreservation of Tropical Plant Ge
+rmplasm: Current Research Progress and Application, pp. 43-56. , F. E
+ngelmann and H. Takagi (eds.) Tsukuba: JIRCAS; Rome: IPGRI

Ferreira, D.F., Análises estatísticas por meio do SISVAR para Windows 
+versão 4.0.1:225 (2000) Reunião Anual da Região Brasileira da Socieda
+de Internacional de Biometria, , São Carlos, SP: UFSCAR
[download]

What I actually get

(1990) Methods of the Association of Official Analytical Chemists. 15.
+ Ed, , Association of Official Analytical Chemists Washington
 Dumet, D.,
D.
Dumet, D., Benson, E.E., The use of physical and biochemical studies t
+o elucidate and reduce cryopreservation-induced damage in hydrated/de
+siccated plant germplasm (2000) Cryopreservation of Tropical Plant Ge
+rmplasm: Current Research Progress and Application, pp. 43-56. , F. E
+ngelmann and H. Takagi (eds.) Tsukuba: JIRCAS; Rome: IPGRI
 Ferreira, D.F.,
F.
Ferreira, D.F., Análises estatísticas por meio do SISVAR para Windows 
+versão 4.0.1:225 (2000) Reunião Anual da Região Brasileira da Socieda
+de Internacional de Biometria, , São Carlos, SP: UFSCAR
[download]

How can I rewrite the regex so that it doesn't echo the author and their initial? I also tried to substitute the semicolon with "\n" using the same regex...

my $part = $text =~ s/\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\,))/\n/
+g;
my @parts = split("\n",$part);
[download]

...but this returns in the number of matching semicolons ("2" in this case). This has been driving me absolutely nuts--any help is appreciated!

In reply to Regex for splitting a string on a semicolon (conditionally) by grouse

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.