Re: Regex for splitting a string on a semicolon (conditionally)

This format looks rather messed up, I doubt that your heuristics will be 100% correct and that Text::CSV can help you here.

Though your problem is that split also returns matched (...) groupings. Disable capturing groupings with (?:...) .

see perlretut#Non-capturing-groupings

update

> followed by a name (these sometimes have hyphens, but I'm not sure I've accounted for that), followed by initials.

I strongly recommend recomposing with sub-regexes like $name and $initial plus /x -flag for readability and maintainability.

you'd quickly discover some bugs I was able to spot (hint: initials are capital letters and A-z is not what you want,, the first entry didn't start with a name but a date... and so on)

Good luck!

Cheers Rolf

PS: Je suis Charlie!

Comment on Re: Regex for splitting a string on a semicolon (conditionally) Select or Download Code

Replies are listed 'Best First'.
Re^2: Regex for splitting a string on a semicolon (conditionally) by grouse (Initiate) on Feb 12, 2015 at 09:03 UTC
The format is definitely not optimal, but this is the only way I can get this data from the source I'm using. I can have 10% loss on the data without being in trouble, though, so I'm trying to make due. The Text::CSV is for filing the sorted references into an output file--sorry to have included this red herring here! Thank you so much for the capturing advice! I didn't know that captures were stored into arrays in this context, but after blocking them everything works. I've never considered putting in named sub-regexes--thanks for the tip. This is one of several steps in sorting the messy data--other areas deal with references with no authors. Some of the entries don't have capital initials, which is why I had `A-z` rather than `A-Z`. I didn't want to post a ton of lines of code (or examples--there are thousands) when my problem was just one regex operation.	[reply] [d/l] [select]
Re^3: Regex for splitting a string on a semicolon (conditionally) by AnomalousMonk (Archbishop) on Feb 12, 2015 at 18:26 UTC
Some of the entries don't have capital initials, which is why I had `A-z` rather than `A-Z`. You may already be aware of this, but the important thing to remember about the `A-z` character class range is that it includes the [ \ ] ^ _ ` characters as well, and `A-Za-z` does not. Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]
Re^3: Regex for splitting a string on a semicolon (conditionally) by Anonymous Monk on Feb 12, 2015 at 09:19 UTC
Yeah, parsing free form stuff is hard :) but its not exactly freeform, its some kind of https://en.wikipedia.org/wiki/Citation#Sciences.2C_mathematics.2C_engineering.2C_physiology.2C_and_medicine ... so something something Biblio::Refbase??? sub breakup { my $filename = $_[0]; my $text = $_[1]; my $re = qr{ (?<apair> (?<key>[^:;,\.\s]+) \s* : \s* (?<val>[^:;,\.\s]+) \s* ; ) \|(?<ppair> (?<key> \p{Uppercase_Letter}w+ ) : (?<val> \p{Uppercase_Letter}+ ) ) \| (?<ppage> pp\.\s\d+-\d+\. ) \| (?<version3> \d+\.\d+\.\d+ (?::\d+)? ) \| (?<version2> \d+\.\d+:\d+ ) \| (?<year> $ \d+ $ ) ##\| (?<name> (?: [^\s,\.]+, )+ ) \| (?<name> (?: \p{Uppercase_Letter}[.\w]+, )+ ) ## \| (?<title> (?:[^\s;.,]+\s)+[.;,] ) \| (?<title> (?:[^\s;.,]+\s*){3,} [.;,?!]? ) \| (?<comma>,) \| (?<notcommanotspace>[^\s,]+) \| (?<space>\s+) \| (?<other>.) }msx; my @parts; while( $text =~ m{$re}g ){ my $it = { %+ }; next if $it->{space}; push @parts, $it; } return \@parts ; } __END__ [ { year => "(1990)" }, { title => "Methods of the Association of Official Analytical Chemis +ts.", }, { notcommanotspace => "15." }, { name => "Ed," }, { comma => "," }, { title => "Association of Official Analytical Chemists Washington;" +, }, { name => "Dumet," }, { name => "D.," }, { name => "Benson," }, { name => "E.E.," }, { title => "The use of physical and biochemical studies to elucidate + and reduce cryopreservation-induced damage in hydrated/desiccated pl +ant germplasm ", }, { year => "(2000)" }, { title => "Cryopreservation of Tropical Plant Germplasm: Current Re +search Progress and Application,", }, { ppage => "pp. 43-56." }, { comma => "," }, { notcommanotspace => "F." }, { title => "Engelmann and H." }, { title => "Takagi " }, { notcommanotspace => "(eds.)" }, { apair => "Tsukuba: JIRCAS;", key => "Tsukuba", val => "JIRCAS" }, { apair => "Rome: IPGRI;", key => "Rome", val => "IPGRI" }, { name => "Ferreira," }, { name => "D.F.," }, { title => "An\x{2B69}ses estat\x{ED34}icas por meio do SISVAR para +Windows vers\x{4BE0}4.", }, { version2 => "0.1:225" }, { year => "(2000)" }, { title => "Reuni\x{4BE0}Anual da Regi\x{4BE0}Brasileira da Sociedad +e Internacional de Biometria,", }, { comma => "," }, { name => "S\x{4BE0}Carlos," }, { title => "SP: UFSCAR" }, ] [download]	[reply] [d/l]