grouse has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone! I have been working on this code for a while, looked up a bunch of threads here, but the regex is acting in ways I can't seem to find an explanation for.

I have a list of scientific references separated by a semicolon (but also occasionally containing a semicolon), and I want each one to be put into a separate element of an array for further processing.

I want to split on a semicolon, but only if it occurs before a space, followed by a name (these sometimes have hyphens, but I'm not sure I've accounted for that), followed by initials.

Here's a snippit of the problem area:

#!/usr/bin/perl use strict; use warnings; use List::MoreUtils 'first_index'; use Text::CSV; use utf8; sub breakup { my $filename = $_[0]; my $text = $_[1]; if ($text =~ /\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\,))/) { my @parts = split (/\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\, +))/,$text); foreach (@parts) { print $_ . "\n"; } } } breakup("scopus0842","(1990) Methods of the Association of Official An +alytical Chemists. 15. Ed, , Association of Official Analytical Chemi +sts Washington; Dumet, D., Benson, E.E., The use of physical and bioc +hemical studies to elucidate and reduce cryopreservation-induced dama +ge in hydrated/desiccated plant germplasm (2000) Cryopreservation of +Tropical Plant Germplasm: Current Research Progress and Application, +pp. 43-56. , F. Engelmann and H. Takagi (eds.) Tsukuba: JIRCAS; Rome: + IPGRI; Ferreira, D.F., Análises estatísticas por meio do SISVAR para + Windows versão 4.0.1:225 (2000) Reunião Anual da Região Brasileira d +a Sociedade Internacional de Biometria, , São Carlos, SP: UFSCAR");

What I want the regex to give me:

(1990) Methods of the Association of Official Analytical Chemists. 15. + Ed, , Association of Official Analytical Chemists Washington Dumet, D., Benson, E.E., The use of physical and biochemical studies t +o elucidate and reduce cryopreservation-induced damage in hydrated/de +siccated plant germplasm (2000) Cryopreservation of Tropical Plant Ge +rmplasm: Current Research Progress and Application, pp. 43-56. , F. E +ngelmann and H. Takagi (eds.) Tsukuba: JIRCAS; Rome: IPGRI Ferreira, D.F., Análises estatísticas por meio do SISVAR para Windows +versão 4.0.1:225 (2000) Reunião Anual da Região Brasileira da Socieda +de Internacional de Biometria, , São Carlos, SP: UFSCAR

What I actually get

(1990) Methods of the Association of Official Analytical Chemists. 15. + Ed, , Association of Official Analytical Chemists Washington Dumet, D., D. Dumet, D., Benson, E.E., The use of physical and biochemical studies t +o elucidate and reduce cryopreservation-induced damage in hydrated/de +siccated plant germplasm (2000) Cryopreservation of Tropical Plant Ge +rmplasm: Current Research Progress and Application, pp. 43-56. , F. E +ngelmann and H. Takagi (eds.) Tsukuba: JIRCAS; Rome: IPGRI Ferreira, D.F., F. Ferreira, D.F., Análises estatísticas por meio do SISVAR para Windows +versão 4.0.1:225 (2000) Reunião Anual da Região Brasileira da Socieda +de Internacional de Biometria, , São Carlos, SP: UFSCAR

How can I rewrite the regex so that it doesn't echo the author and their initial? I also tried to substitute the semicolon with "\n" using the same regex...

my $part = $text =~ s/\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\,))/\n/ +g; my @parts = split("\n",$part);

...but this returns in the number of matching semicolons ("2" in this case). This has been driving me absolutely nuts--any help is appreciated!

Replies are listed 'Best First'.
Re: Regex for splitting a string on a semicolon (conditionally)
by LanX (Saint) on Feb 12, 2015 at 04:24 UTC
    This format looks rather messed up, I doubt that your heuristics will be 100% correct and that Text::CSV can help you here.

    Though your problem is that split also returns matched (...) groupings. Disable capturing groupings with (?:...) .

    see perlretut#Non-capturing-groupings

    update

    > followed by a name (these sometimes have hyphens, but I'm not sure I've accounted for that), followed by initials.

    I strongly recommend recomposing with sub-regexes like $name and $initial plus /x -flag for readability and maintainability.

    you'd quickly discover some bugs I was able to spot (hint: initials are capital letters and A-z is not what you want,, the first entry didn't start with a name but a date... and so on)

    Good luck!

    Cheers Rolf

    PS: Je suis Charlie!

      The format is definitely not optimal, but this is the only way I can get this data from the source I'm using. I can have 10% loss on the data without being in trouble, though, so I'm trying to make due. The Text::CSV is for filing the sorted references into an output file--sorry to have included this red herring here!

      Thank you so much for the capturing advice! I didn't know that captures were stored into arrays in this context, but after blocking them everything works. I've never considered putting in named sub-regexes--thanks for the tip.

      This is one of several steps in sorting the messy data--other areas deal with references with no authors. Some of the entries don't have capital initials, which is why I had A-z rather than A-Z. I didn't want to post a ton of lines of code (or examples--there are thousands) when my problem was just one regex operation.

        Some of the entries don't have capital initials, which is why I had A-z rather than A-Z.

        You may already be aware of this, but the important thing to remember about the  A-z character class range is that it includes the  [ \ ] ^ _ ` characters as well, and  A-Za-z does not.


        Give a man a fish:  <%-(-(-(-<

        Yeah, parsing free form stuff is hard :) but its not exactly freeform, its some kind of https://en.wikipedia.org/wiki/Citation#Sciences.2C_mathematics.2C_engineering.2C_physiology.2C_and_medicine ... so something something Biblio::Refbase???
        sub breakup { my $filename = $_[0]; my $text = $_[1]; my $re = qr{ (?<apair> (?<key>[^:;,\.\s]+) \s* : \s* (?<val>[^:;,\.\s]+) \s* ; ) |(?<ppair> (?<key> \p{Uppercase_Letter}w+ ) : (?<val> \p{Uppercase_Letter}+ ) ) | (?<ppage> pp\.\s*\d+-\d+\. ) | (?<version3> \d+\.\d+\.\d+ (?::\d+)? ) | (?<version2> \d+\.\d+:\d+ ) | (?<year> \( \d+ \) ) ##| (?<name> (?: [^\s,\.]+, )+ ) | (?<name> (?: \p{Uppercase_Letter}[.\w]+, )+ ) ## | (?<title> (?:[^\s;.,\(\)]+\s*)+[.;,] ) | (?<title> (?:[^\s;.,\(\)]+\s*){3,} [.;,?!]? ) | (?<comma>,) | (?<notcommanotspace>[^\s,]+) | (?<space>\s+) | (?<other>.) }msx; my @parts; while( $text =~ m{$re}g ){ my $it = { %+ }; next if $it->{space}; push @parts, $it; } return \@parts ; } __END__ [ { year => "(1990)" }, { title => "Methods of the Association of Official Analytical Chemis +ts.", }, { notcommanotspace => "15." }, { name => "Ed," }, { comma => "," }, { title => "Association of Official Analytical Chemists Washington;" +, }, { name => "Dumet," }, { name => "D.," }, { name => "Benson," }, { name => "E.E.," }, { title => "The use of physical and biochemical studies to elucidate + and reduce cryopreservation-induced damage in hydrated/desiccated pl +ant germplasm ", }, { year => "(2000)" }, { title => "Cryopreservation of Tropical Plant Germplasm: Current Re +search Progress and Application,", }, { ppage => "pp. 43-56." }, { comma => "," }, { notcommanotspace => "F." }, { title => "Engelmann and H." }, { title => "Takagi " }, { notcommanotspace => "(eds.)" }, { apair => "Tsukuba: JIRCAS;", key => "Tsukuba", val => "JIRCAS" }, { apair => "Rome: IPGRI;", key => "Rome", val => "IPGRI" }, { name => "Ferreira," }, { name => "D.F.," }, { title => "An\x{2B69}ses estat\x{ED34}icas por meio do SISVAR para +Windows vers\x{4BE0}4.", }, { version2 => "0.1:225" }, { year => "(2000)" }, { title => "Reuni\x{4BE0}Anual da Regi\x{4BE0}Brasileira da Sociedad +e Internacional de Biometria,", }, { comma => "," }, { name => "S\x{4BE0}Carlos," }, { title => "SP: UFSCAR" }, ]
Re: Regex for splitting a string on a semicolon (conditionally)
by AnomalousMonk (Archbishop) on Feb 12, 2015 at 05:14 UTC

    I strongly second the point made in the reply of LanX that you need to re-factor the unholy split regex you are attempting to use into something that is both readable and independently testable. (This assumes that Text::CSV will be of little help, which I also suspect is true.) Ideally, you will wind up with something like
        my @parts = split /;(?=\s$name)/, $text;
    (or  qr{ , (?= \s $name) }xms as I would rather write the regex). This leaves you with the still substantial problem of defining
        my $name = qr{ ... }xms;
    which will not be easy, but at least you will have an independently testable pattern at which you can throw "names" galore, as many as you can possibly think of, in order to test its efficacy.


    Give a man a fish:  <%-(-(-(-<

Re: Regex for splitting a string on a semicolon (conditionally)
by LanX (Saint) on Feb 12, 2015 at 04:51 UTC
    > my $part = $text =~ s/\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\,))/\n/g;

    > ...but this returns in the number of matching semicolons ("2" in this case). This has been driving me absolutely nuts--any help is appreciated!

    $text =~ s/\;(?=(\s[A-Z]{1}[a-z]+,\s([a-zA-z]\.-?)+\,))/\n/g;

    does the substitution in $text but returns their number in scalar context.

    drop $part and do my @parts = split("\n",$text);

    Cheers Rolf

    PS: Je suis Charlie!

      Thank you so much. I feel somewhat embarassed for missing that one! Making the change makes everything work as intended, thanks!
Re: Regex for splitting a string on a semicolon (conditionally)
by Ea (Chaplain) on Feb 12, 2015 at 10:18 UTC
    Having found this out last summer, extracting citations from papers reliably is hard. If you're getting the references direct (no pre-cleaning), then after you solve the semi-colon problem you're likely to find about 1-5% of your citations fail on something else because of how an author has formatted the reference in an intelligible, but machine-confounding, fashion (or mistyped or just missed a bit). You end up with regex after regex, just to catch all the perversities. The ADS project has likely had to solve this problem. You may be able to find some of their utilities on github.

    Out of almost 3000 papers, I had to find references for about 50 by hand using the search interface and some intuition. The most innocent unfindable reference was in this paper where A. Einstein, Sit. Preus. Akad. (1919). would have been identifiable if Albert hadn't gone and published 2 papers in that journal that year.

    Update: I found it! Six months and no looming deadline has really improved my German.

    Sometimes I can think of 6 impossible LDAP attributes before breakfast.
Re: Regex for splitting a string on a semicolon (conditionally) (use Text::CSV;)
by Anonymous Monk on Feb 12, 2015 at 03:46 UTC

    ... use Text::CSV; ... sub breakup { ...

    Why aren't you using Text::CSV inside sub breakup {?

    Including "use Text::CSV;" isn't some magic spell that will avoid the suggestion that you actually use it

      I'm using it in the larger context of my code to output the data (once broken up into individual references, then into author, year, etc columns) into a .csv file. It has nothing to do with this subroutine--I just copied the beginning of the script verbatum. Sorry for the confusion!
        Ok :) here you go, "Rome: IPGRI" gets seperated from its owner, then its put back
        sub breakup { my $filename = $_[0]; my $text = $_[1]; my @stuff = split /;\s?/, $text; my @parts ; while( @stuff ){ if( $stuff[0] =~ m{ ^ \w+ : \s \w+ $ }xms ){ $parts[-1] .= ";".$stuff[0]; } else { push @parts, $stuff[0]; } shift @stuff; } return \@parts; }
Re: Regex for splitting a string on a semicolon (conditionally)
by Anonymous Monk on Feb 12, 2015 at 04:47 UTC
    Why is your text in Latin-1? Is that Perlmonks or is it really Latin-1? You use utf8, do you get warnings about malformed characters?