grouse has asked for the wisdom of the Perl Monks concerning the following question:
Hello everyone! I have been working on this code for a while, looked up a bunch of threads here, but the regex is acting in ways I can't seem to find an explanation for.
I have a list of scientific references separated by a semicolon (but also occasionally containing a semicolon), and I want each one to be put into a separate element of an array for further processing.
I want to split on a semicolon, but only if it occurs before a space, followed by a name (these sometimes have hyphens, but I'm not sure I've accounted for that), followed by initials.
Here's a snippit of the problem area:
#!/usr/bin/perl use strict; use warnings; use List::MoreUtils 'first_index'; use Text::CSV; use utf8; sub breakup { my $filename = $_[0]; my $text = $_[1]; if ($text =~ /\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\,))/) { my @parts = split (/\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\, +))/,$text); foreach (@parts) { print $_ . "\n"; } } } breakup("scopus0842","(1990) Methods of the Association of Official An +alytical Chemists. 15. Ed, , Association of Official Analytical Chemi +sts Washington; Dumet, D., Benson, E.E., The use of physical and bioc +hemical studies to elucidate and reduce cryopreservation-induced dama +ge in hydrated/desiccated plant germplasm (2000) Cryopreservation of +Tropical Plant Germplasm: Current Research Progress and Application, +pp. 43-56. , F. Engelmann and H. Takagi (eds.) Tsukuba: JIRCAS; Rome: + IPGRI; Ferreira, D.F., Análises estatísticas por meio do SISVAR para + Windows versão 4.0.1:225 (2000) Reunião Anual da Região Brasileira d +a Sociedade Internacional de Biometria, , São Carlos, SP: UFSCAR");
What I want the regex to give me:
(1990) Methods of the Association of Official Analytical Chemists. 15. + Ed, , Association of Official Analytical Chemists Washington Dumet, D., Benson, E.E., The use of physical and biochemical studies t +o elucidate and reduce cryopreservation-induced damage in hydrated/de +siccated plant germplasm (2000) Cryopreservation of Tropical Plant Ge +rmplasm: Current Research Progress and Application, pp. 43-56. , F. E +ngelmann and H. Takagi (eds.) Tsukuba: JIRCAS; Rome: IPGRI Ferreira, D.F., Análises estatísticas por meio do SISVAR para Windows +versão 4.0.1:225 (2000) Reunião Anual da Região Brasileira da Socieda +de Internacional de Biometria, , São Carlos, SP: UFSCAR
What I actually get
(1990) Methods of the Association of Official Analytical Chemists. 15. + Ed, , Association of Official Analytical Chemists Washington Dumet, D., D. Dumet, D., Benson, E.E., The use of physical and biochemical studies t +o elucidate and reduce cryopreservation-induced damage in hydrated/de +siccated plant germplasm (2000) Cryopreservation of Tropical Plant Ge +rmplasm: Current Research Progress and Application, pp. 43-56. , F. E +ngelmann and H. Takagi (eds.) Tsukuba: JIRCAS; Rome: IPGRI Ferreira, D.F., F. Ferreira, D.F., Análises estatísticas por meio do SISVAR para Windows +versão 4.0.1:225 (2000) Reunião Anual da Região Brasileira da Socieda +de Internacional de Biometria, , São Carlos, SP: UFSCAR
How can I rewrite the regex so that it doesn't echo the author and their initial? I also tried to substitute the semicolon with "\n" using the same regex...
my $part = $text =~ s/\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\,))/\n/ +g; my @parts = split("\n",$part);
...but this returns in the number of matching semicolons ("2" in this case). This has been driving me absolutely nuts--any help is appreciated!
|
|---|