split $data, $unquoted

Ovid has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: split $data, $unquoted_value; by blokhead (Monsignor) on Sep 14, 2005 at 20:53 UTC
The fact that you have in your test data an apostrophe that does not delimit a quotation makes things harder (the apostrophe in not if it's quoted). I don't think there's a way to do the Right Thing that kind of fuzzy data, but according to your written specification, here's a way to do it. The trick is to think like a lexer. A quotation should be treated as one atomic entity, just like a character. Now the components that make up a sentence are entire quotations and individual non-period characters. Breaking it down this way, it's straightforward to see: `my $data = qq[ this is some text. A period (".") usually terminates a statement. But not if it is quoted. Regardless of whether or not single quotes, '.', are used. And yes, "Mr. Ovid," even lines with a period in the middle of a quo +te. ]; my $doublequoted = qr/"[^"\\](?:\\.[^"\\])"/m; my $singlequoted = qr/'[^'\\](?:\\.[^'\\])'/m; my $sentence = qr/ (?: $singlequoted \| $doublequoted \| [^.] )* \. +/xm; my @items = $data =~ /($sentence)/g; print "[$_]\n" for @items;` [download] (delimited quote regexes stolen shamelessly from Abigail-II's Re: regex regexen) The only thing is to be careful that a quotation match is attempted first in the $sentence definition. This solution does not require the period(s) inside a quotation to be at the end of the quotation, which is a problem I think some of the other solutions suffer from. Update: to allow for non-terminator "." characters inside floating point numbers (as per Re^2: split $data, $unquoted_value;), here is a rough addition: `my $float = qr/\d+\.\d+/; my $sentence = qr/ (?: $float \| $singlequoted \| $doublequoted \| [^ +.] )+ \. /xm; $data .= "g = 9.8 m/s.";` [download] You can add other exceptions similarly... blokhead	[reply] [d/l] [select]
Re: split $data, $unquoted_value; by jch341277 (Sexton) on Sep 14, 2005 at 20:43 UTC
I found Text::Sentence that appears to work on your example... Read more... 492002.pl (748 Bytes) When executed generates this Read more... output (522 Bytes) Update: Changed "this is some text." to "This is some text." because the module apparently uses capitalization to identify sentence boundaries. So it might not work for you...	[reply] [d/l] [select]
Re^2: split $data, $unquoted_value; by Ovid (Cardinal) on Sep 14, 2005 at 21:20 UTC
Tempting, but as you suspected, it doesn't quite work. I actually need this to be able to better parse Prolog programs. Cheers, Ovid New address of my CGI Course.	[reply]
Re: split $data, $unquoted_value; by Codon (Friar) on Sep 14, 2005 at 20:40 UTC
Put this in your `split` and see how it fits: `my $re = qr/(?<!["'])(?<=\.)\s(?!["'])/;` [download] This will split behind the period, but only if it is not (directly) quoted. Edit:* added \s* to strip out space between sentences. Ivan Heffner Sr. Software Engineer, DAS Lead WhitePages.com, Inc.	[reply] [d/l] [select]
Re: split $data, $unquoted_value; by GrandFather (Saint) on Sep 14, 2005 at 20:41 UTC
A regex that catches common cases is fairly easy. However handling nasty nested cases with full stops that are not adjacent to a quote heads into CSV teritory. Here's the simple case with an example of not handling something a little nastier: `use warnings; use strict; my $str = do {local $/ = ''; <DATA>;}; print join "^", split /(?<!['"])\.(?!['"])/, $str; __DATA__ this is some text. A period (".") usually terminates a statement. But not if it's quoted. Regardless of whether or not single quotes, '.', are used. "Quoted sentences with a . in the middle." could be harder to manage h +owever.` [download] Generates: `this is some text^ A period (".") usually terminates a statement^ But not if it's quoted^ Regardless of whether or not single quotes, '.', are used^ "Quoted sentences with a ^ in the middle." could be harder to manage h +owever` [download] Perl is Huffman encoded by design.	[reply] [d/l] [select]
Re: split $data, $unquoted_value; by QM (Parson) on Sep 14, 2005 at 21:45 UTC
Do you need to consider periods in decimals and abbreviations? `Mr. Smith paid Jr. $7.43 for 0.4 lbs. of coffee.` [download] You'll find that the period is overused and highly context sensitive. There are even ambiguities that cannot be determined contextually, but need semantic hints (or the original author to clarify). Perhaps other language designers avoided the period on purpose? -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l]
Re^2: split $data, $unquoted_value; by Ovid (Cardinal) on Sep 14, 2005 at 21:49 UTC
Ah, crud. Yes. I do need to consider decimals (though not abbreviations). `loves("Mr. Poe", perl) :- version(perl, Version), Version >= 5.8.` [download] Cheers, Ovid New address of my CGI Course.	[reply] [d/l]
Re: split $data, $unquoted_value; by kvale (Monsignor) on Sep 14, 2005 at 20:44 UTC
CSV and more generally, xSV, documents have similar quoting problems to the above, so you might look at the module Text::xSV for pointers I don't know if Text::xSV will handle quotes only around a portion of a field, but it is a start on the finite state mchine that would handle these sorts of problems. -Mark	[reply]
Re: split $data, $unquoted_value; by Roy Johnson (Monsignor) on Sep 14, 2005 at 21:28 UTC
You already have a split that does what you're describing in your update. `@data = split /($RE{quoted})\|\./;` [download] By putting the "unless" portion in capturing parens, they are captured and included with the results. I think that when a dot is encountered, it will generate an empty data item, so you might want to grep out empty results. `@data = grep length, split /($RE{quoted})\|\./;` [download] Update: Gah, that's not the same thing. You really need to do two passes: split as above, then join anything that isn't separated by an empty string element. Something like: `my $accum; @data = map { if ($length) { $accum .= $_; () } else { my $x = $accum; $accum = ''; $x } } split /($RE{quoted})\|\./; push @data, $accum;` [download] Caveat: I can't test code today. Caution: Contents may have been coded under pressure.	[reply] [d/l] [select]
Re^2: split $data, $unquoted_value; by Ovid (Cardinal) on Sep 14, 2005 at 22:12 UTC
I also need to capture the split value (in this case, just a period). However, you did remind me of that useful feature of split returning items in capturing parens. If my theoretical `Data::Record` module comes to light, I can use that to control the chomp-like behavior. Cheers, Ovid New address of my CGI Course.	[reply]
Re^2: split $data, $unquoted_value; by duelafn (Parson) on Sep 15, 2005 at 16:15 UTC
Great idea, this works. `#!/usr/bin/perl use strict; use warnings; use Regexp::Common qw/delimited number/; use YAML; my $x = 'Foo "ba . r". Baz. $2.67 per pound. is a "." in a sentence re +ally a "."?'; my @x = split /($RE{delimited}{-delim=>'"'}\|$RE{num}{real}\|\.)/, $x; my @y = (''); for (@x) { $y[-1] .= $_; push @y, '' if $_ eq '.'; } pop @y if $y[-1] eq ''; print Dump \@y;` [download] Good Day, Dean	[reply] [d/l]
Re: split $data, $unquoted_value; by ruoso (Curate) on Sep 14, 2005 at 22:52 UTC
I have the same problem when I want to split CSV files into its columns. The point I got is that I couldn't do it with a regexp, so I did it the way it would be done without regexps... Basically, I read character by character and push the contexts I enter... For instance.... `"This is a column","Yes, I know",12323,23123.23,"This is a \"column\""` [download] should be splitted in `This is a column Yes, I know 12323 23123.23 This is a "column"` [download] ok, no more talking... the code talks by itself... #!/usr/bin/perl -w use strict; my $origin = '"This is a column","Yes, I know",123123,23123.23,"This i +s a \"column\""'; my @cols = parse_line($origin); print join("\n", @cols)."\n"; sub parse_line { my $line = shift; my @contexts; my $context = ""; my $column; my @cols; my $string_delim = '"'; my $escape_char = "\\"; my $field_delim = ','; for (my $i = 0; $i < length $line; $i++) { my $c = substr($line, $i, 1); if ($c eq $string_delim) { if ($context eq "string") { $context = shift @contexts; } elsif ($context eq "escape") { $column .= $c; $context = shift @contexts; } else { push @contexts, $context; $context = "string"; } } elsif ($c eq $escape_char) { if ($context eq "escape") { $column .= $c; $context = shift @contexts; } else { push @contexts, $context; $context = "escape"; } } elsif ($c eq $field_delim) { if ($context eq "string") { $column .= $c; } elsif ($context eq "escape") { $column .= $c; $context = shift @contexts; } else { push @cols, $column; undef $column; } } else { $column .= $c; } if ($i == length($line) - 1) { push @cols, $column; undef $column; } } return @cols; } [download] daniel	[reply] [d/l] [select]
Re^2: split $data, $unquoted_value; by motob�i (Beadle) on Sep 15, 2005 at 13:33 UTC
Ruoso, wouldn't Text::CSV and related modules (XS, PP) work for you? I suppose you know this module, so what am I missing? `#!/usr/bin/perl -w use Text::CSV_XS; my $csv = Text::CSV_XS->new( { 'escape_char' => '\\' } ); my $line = '"This is a column","Yes, I know",12323,23123.23,"This is a + \"column\""'; if ( $csv->parse($line) ) { print join "\n", $csv->fields, "\n"; } else { print "Could not parse line\n", $csv->error_input, "\n"; }` [download] motob�i	[reply] [d/l]
Re^3: split $data, $unquoted_value; by ruoso (Curate) on Sep 15, 2005 at 17:09 UTC
Well... Good question ;)... Maybe I thought it was easier to write the code then looking for a module... how dumb I am... anyway, it was fun to write it... :) daniel	[reply]
Re: split $data, $unquoted_value; by demerphq (Chancellor) on Sep 15, 2005 at 10:40 UTC
Heres my go. It assumes that if a quote is used as a quote it won't be surrounded on both sides by an alpha char. $_=<<END; this is some text. A period (".") usually terminates a statement. But not if it's quoted. Regardless of whether or not single quotes, '.', are used. END #' my $accept=qr/(?: (?<![a-z]) (?: [''] (?:[^''\\]+\|\\.)* [''] \| [""] (?:[^""\\]+\|\\.)* [""] ) (?![a-z]) \| [^\\''"".]+ \| [''""] )+/xi; my @parts; while (/($accept\|([.][\r\n]))/gxi) { push @parts,$1 unless $2; } print "part1: $_\n" for @parts; my @parts2=split / (?!$accept)([.][\r\n])/xi, $_; print "---\n"; print "part2: $_\n" for @parts; __END__ part1: this is some text part1: A period (".") usually terminates a statement part1: But not if it's quoted part1: Regardless of whether or not single quotes, '.', are used --- part2: this is some text part2: A period (".") usually terminates a statement part2: But not if it's quoted part2: Regardless of whether or not single quotes, '.', are used [download] Just in case anyone wonders, the `['']` trick is harmless, perl just ignores dupes in a char class, but it keeps some syntax highlighting editors (like mine) from getting confused. --- $world=~s/war/peace/g	[reply] [d/l] [select]