Hello everyone! I have been working on this code for a while, looked up a bunch of threads here, but the regex is acting in ways I can't seem to find an explanation for.

I have a list of scientific references separated by a semicolon (but also occasionally containing a semicolon), and I want each one to be put into a separate element of an array for further processing.

I want to split on a semicolon, but only if it occurs before a space, followed by a name (these sometimes have hyphens, but I'm not sure I've accounted for that), followed by initials.

Here's a snippit of the problem area:

#!/usr/bin/perl use strict; use warnings; use List::MoreUtils 'first_index'; use Text::CSV; use utf8; sub breakup { my $filename = $_[0]; my $text = $_[1]; if ($text =~ /\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\,))/) { my @parts = split (/\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\, +))/,$text); foreach (@parts) { print $_ . "\n"; } } } breakup("scopus0842","(1990) Methods of the Association of Official An +alytical Chemists. 15. Ed, , Association of Official Analytical Chemi +sts Washington; Dumet, D., Benson, E.E., The use of physical and bioc +hemical studies to elucidate and reduce cryopreservation-induced dama +ge in hydrated/desiccated plant germplasm (2000) Cryopreservation of +Tropical Plant Germplasm: Current Research Progress and Application, +pp. 43-56. , F. Engelmann and H. Takagi (eds.) Tsukuba: JIRCAS; Rome: + IPGRI; Ferreira, D.F., Análises estatísticas por meio do SISVAR para + Windows versão 4.0.1:225 (2000) Reunião Anual da Região Brasileira d +a Sociedade Internacional de Biometria, , São Carlos, SP: UFSCAR");

What I want the regex to give me:

(1990) Methods of the Association of Official Analytical Chemists. 15. + Ed, , Association of Official Analytical Chemists Washington Dumet, D., Benson, E.E., The use of physical and biochemical studies t +o elucidate and reduce cryopreservation-induced damage in hydrated/de +siccated plant germplasm (2000) Cryopreservation of Tropical Plant Ge +rmplasm: Current Research Progress and Application, pp. 43-56. , F. E +ngelmann and H. Takagi (eds.) Tsukuba: JIRCAS; Rome: IPGRI Ferreira, D.F., Análises estatísticas por meio do SISVAR para Windows +versão 4.0.1:225 (2000) Reunião Anual da Região Brasileira da Socieda +de Internacional de Biometria, , São Carlos, SP: UFSCAR

What I actually get

(1990) Methods of the Association of Official Analytical Chemists. 15. + Ed, , Association of Official Analytical Chemists Washington Dumet, D., D. Dumet, D., Benson, E.E., The use of physical and biochemical studies t +o elucidate and reduce cryopreservation-induced damage in hydrated/de +siccated plant germplasm (2000) Cryopreservation of Tropical Plant Ge +rmplasm: Current Research Progress and Application, pp. 43-56. , F. E +ngelmann and H. Takagi (eds.) Tsukuba: JIRCAS; Rome: IPGRI Ferreira, D.F., F. Ferreira, D.F., Análises estatísticas por meio do SISVAR para Windows +versão 4.0.1:225 (2000) Reunião Anual da Região Brasileira da Socieda +de Internacional de Biometria, , São Carlos, SP: UFSCAR

How can I rewrite the regex so that it doesn't echo the author and their initial? I also tried to substitute the semicolon with "\n" using the same regex...

my $part = $text =~ s/\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\,))/\n/ +g; my @parts = split("\n",$part);

...but this returns in the number of matching semicolons ("2" in this case). This has been driving me absolutely nuts--any help is appreciated!


In reply to Regex for splitting a string on a semicolon (conditionally) by grouse

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.