Re: split $data, $unquoted_value;
by blokhead (Monsignor) on Sep 14, 2005 at 20:53 UTC
|
The fact that you have in your test data an apostrophe that does not delimit a quotation makes things harder (the apostrophe in not if it's quoted). I don't think there's a way to do the Right Thing that kind of fuzzy data, but according to your written specification, here's a way to do it.
The trick is to think like a lexer. A quotation should be treated as one atomic entity, just like a character. Now the components that make up a sentence are entire quotations and individual non-period characters. Breaking it down this way, it's straightforward to see:
my $data = qq[
this is some text.
A period (".") usually terminates a statement.
But not if it is quoted.
Regardless of whether or not single quotes, '.', are used.
And yes, "Mr. Ovid," even lines with a period in the middle of a quo
+te.
];
my $doublequoted = qr/"[^"\\]*(?:\\.[^"\\]*)*"/m;
my $singlequoted = qr/'[^'\\]*(?:\\.[^'\\]*)*'/m;
my $sentence = qr/ (?: $singlequoted | $doublequoted | [^.] )* \.
+/xm;
my @items = $data =~ /($sentence)/g;
print "[$_]\n" for @items;
(delimited quote regexes stolen shamelessly from Abigail-II's Re: regex regexen)
The only thing is to be careful that a quotation match is attempted first in the $sentence definition.
This solution does not require the period(s) inside a quotation to be at the end of the quotation, which is a problem I think some of the other solutions suffer from.
Update: to allow for non-terminator "." characters inside floating point numbers (as per Re^2: split $data, $unquoted_value;), here is a rough addition:
my $float = qr/\d+\.\d+/;
my $sentence = qr/ (?: $float | $singlequoted | $doublequoted | [^
+.] )+ \. /xm;
$data .= "g = 9.8 m/s.";
You can add other exceptions similarly...
| [reply] [d/l] [select] |
Re: split $data, $unquoted_value;
by jch341277 (Sexton) on Sep 14, 2005 at 20:43 UTC
|
I found Text::Sentence that appears to work on your example...
When executed generates this
Update: Changed "this is some text." to "This is some text." because the module apparently uses capitalization to identify sentence boundaries. So it might not work for you...
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: split $data, $unquoted_value;
by Codon (Friar) on Sep 14, 2005 at 20:40 UTC
|
Put this in your split and see how it fits:
my $re = qr/(?<!["'])(?<=\.)\s*(?!["'])/;
This will split behind the period, but only if it is not (directly) quoted.
Edit: added \s* to strip out space between sentences.
Ivan Heffner
Sr. Software Engineer, DAS Lead
WhitePages.com, Inc.
| [reply] [d/l] [select] |
Re: split $data, $unquoted_value;
by GrandFather (Saint) on Sep 14, 2005 at 20:41 UTC
|
A regex that catches common cases is fairly easy. However handling nasty nested cases with full stops that are not adjacent to a quote heads into CSV teritory. Here's the simple case with an example of not handling something a little nastier:
use warnings;
use strict;
my $str = do {local $/ = ''; <DATA>;};
print join "^", split /(?<!['"])\.(?!['"])/, $str;
__DATA__
this is some text.
A period (".") usually terminates a statement.
But not if it's quoted.
Regardless of whether or not single quotes, '.', are used.
"Quoted sentences with a . in the middle." could be harder to manage h
+owever.
Generates:
this is some text^
A period (".") usually terminates a statement^
But not if it's quoted^
Regardless of whether or not single quotes, '.', are used^
"Quoted sentences with a ^ in the middle." could be harder to manage h
+owever
Perl is Huffman encoded by design.
| [reply] [d/l] [select] |
Re: split $data, $unquoted_value;
by QM (Parson) on Sep 14, 2005 at 21:45 UTC
|
Do you need to consider periods in decimals and abbreviations?
Mr. Smith paid Jr. $7.43 for 0.4 lbs. of coffee.
You'll find that the period is overused and highly context sensitive. There are even ambiguities that cannot be determined contextually, but need semantic hints (or the original author to clarify).
Perhaps other language designers avoided the period on purpose?
-QM
--
Quantum Mechanics: The dreams stuff is made of
| [reply] [d/l] |
|
|
loves("Mr. Poe", perl) :-
version(perl, Version),
Version >= 5.8.
| [reply] [d/l] |
Re: split $data, $unquoted_value;
by kvale (Monsignor) on Sep 14, 2005 at 20:44 UTC
|
CSV and more generally, xSV, documents have similar quoting problems to the above, so you might look at the module Text::xSV for pointers I don't know if Text::xSV will handle quotes only around a portion of a field, but it is a start on the finite state mchine that would handle these sorts of problems.
| [reply] |
Re: split $data, $unquoted_value;
by Roy Johnson (Monsignor) on Sep 14, 2005 at 21:28 UTC
|
You already have a split that does what you're describing in your update.
@data = split /($RE{quoted})|\./;
By putting the "unless" portion in capturing parens, they are captured and included with the results. I think that when a dot is encountered, it will generate an empty data item, so you might want to grep out empty results.
@data = grep length, split /($RE{quoted})|\./;
Update: Gah, that's not the same thing. You really need to do two passes: split as above, then join anything that isn't separated by an empty string element. Something like:
my $accum;
@data = map {
if ($length) { $accum .= $_; () }
else { my $x = $accum; $accum = ''; $x }
}
split /($RE{quoted})|\./;
push @data, $accum;
Caveat: I can't test code today.
Caution: Contents may have been coded under pressure.
| [reply] [d/l] [select] |
|
|
I also need to capture the split value (in this case, just a period). However, you did remind me of that useful feature of split returning items in capturing parens. If my theoretical Data::Record module comes to light, I can use that to control the chomp-like behavior.
| [reply] |
|
|
#!/usr/bin/perl
use strict;
use warnings;
use Regexp::Common qw/delimited number/;
use YAML;
my $x = 'Foo "ba . r". Baz. $2.67 per pound. is a "." in a sentence re
+ally a "."?';
my @x = split /($RE{delimited}{-delim=>'"'}|$RE{num}{real}|\.)/, $x;
my @y = ('');
for (@x) {
$y[-1] .= $_;
push @y, '' if $_ eq '.';
}
pop @y if $y[-1] eq '';
print Dump \@y;
| [reply] [d/l] |
Re: split $data, $unquoted_value;
by ruoso (Curate) on Sep 14, 2005 at 22:52 UTC
|
I have the same problem when I want to split CSV files into its columns. The point I got is that I couldn't do it with a regexp, so I did it the way it would be done without regexps...
Basically, I read character by character and push the contexts I enter... For instance....
"This is a column","Yes, I know",12323,23123.23,"This is a \"column\""
should be splitted in
This is a column
Yes, I know
12323
23123.23
This is a "column"
ok, no more talking... the code talks by itself...
#!/usr/bin/perl -w
use strict;
my $origin = '"This is a column","Yes, I know",123123,23123.23,"This i
+s a \"column\""';
my @cols = parse_line($origin);
print join("\n", @cols)."\n";
sub parse_line {
my $line = shift;
my @contexts;
my $context = "";
my $column;
my @cols;
my $string_delim = '"';
my $escape_char = "\\";
my $field_delim = ',';
for (my $i = 0; $i < length $line; $i++) {
my $c = substr($line, $i, 1);
if ($c eq $string_delim) {
if ($context eq "string") {
$context = shift @contexts;
} elsif ($context eq "escape") {
$column .= $c;
$context = shift @contexts;
} else {
push @contexts, $context;
$context = "string";
}
} elsif ($c eq $escape_char) {
if ($context eq "escape") {
$column .= $c;
$context = shift @contexts;
} else {
push @contexts, $context;
$context = "escape";
}
} elsif ($c eq $field_delim) {
if ($context eq "string") {
$column .= $c;
} elsif ($context eq "escape") {
$column .= $c;
$context = shift @contexts;
} else {
push @cols, $column;
undef $column;
}
} else {
$column .= $c;
}
if ($i == length($line) - 1) {
push @cols, $column;
undef $column;
}
}
return @cols;
}
| [reply] [d/l] [select] |
|
|
Ruoso, wouldn't Text::CSV and related modules (XS, PP) work for you? I suppose you know this module, so what am I missing?
#!/usr/bin/perl -w
use Text::CSV_XS;
my $csv = Text::CSV_XS->new( { 'escape_char' => '\\' } );
my $line = '"This is a column","Yes, I know",12323,23123.23,"This is a
+ \"column\""';
if ( $csv->parse($line) ) {
print join "\n", $csv->fields, "\n";
}
else {
print "Could not parse line\n", $csv->error_input, "\n";
}
motobói | [reply] [d/l] |
|
|
| [reply] |
Re: split $data, $unquoted_value;
by demerphq (Chancellor) on Sep 15, 2005 at 10:40 UTC
|
Heres my go. It assumes that if a quote is used as a quote it won't be surrounded on both sides by an alpha char.
$_=<<END;
this is some text.
A period (".") usually terminates a statement.
But not if it's quoted.
Regardless of whether or not single quotes, '.', are used.
END
#'
my $accept=qr/(?:
(?<![a-z])
(?:
[''] (?:[^''\\]+|\\.)* ['']
|
[""] (?:[^""\\]+|\\.)* [""]
)
(?![a-z])
|
[^\\''"".]+
|
[''""]
)+/xi;
my @parts;
while (/($accept|([.][\r\n]*))/gxi)
{
push @parts,$1 unless $2;
}
print "part1: $_\n" for @parts;
my @parts2=split / (?!$accept)([.][\r\n]*)/xi, $_;
print "---\n";
print "part2: $_\n" for @parts;
__END__
part1: this is some text
part1: A period (".") usually terminates a statement
part1: But not if it's quoted
part1: Regardless of whether or not single quotes, '.', are used
---
part2: this is some text
part2: A period (".") usually terminates a statement
part2: But not if it's quoted
part2: Regardless of whether or not single quotes, '.', are used
Just in case anyone wonders, the [''] trick is harmless, perl just ignores dupes in a char class, but it keeps some syntax highlighting editors (like mine) from getting confused.
---
$world=~s/war/peace/g
| [reply] [d/l] [select] |