Ovid has asked for the wisdom of the Perl Monks concerning the following question:

A common problem that I face in parsing complex data is needing to split the data on an unquoted value. For example, consider the following text.

this is some text. A period (".") usually terminates a statement. But not if it's quoted. Regardless of whether or not single quotes, '.', are used.

It would be nice to be able to split that into 4 individual records but just splitting on a period won't work. However, this problem is general enough that it would be nice to create a "super split" that will split data into discrete elements, but only if the data you are splitting on matches certain more complex parameters (such as being quoted, in this case).

I haven't seen a module that offers this general functionality but it's possible I missed something. Can anyone offer suggestions? Something for the specific case would be fine, but a general purpose solution would be awesome.

Update: after reading the replies, a different strategy occurs to me. Supplying an "unless" option would be helpful.

use Regexp::Common; use Data::Record; # doesn't exist my $record = Data::Record->new( split => qr/\./, unless => $RE{quoted}, ); my @data = $record->split($data);

Internally, it would be a bit inefficient in that it would have to read all of the data at once. Then, it would go through the data and find all text that matches the "unless" and "split" regexen and replace that with a unique token that does not match the split token. Then, it could just split the data. It iterates over the resulting records and replaces the tokens with the original text. I believe Filter::Simple used a similar strategy.

Cheers,
Ovid

New address of my CGI Course.

Replies are listed 'Best First'.
Re: split $data, $unquoted_value;
by blokhead (Monsignor) on Sep 14, 2005 at 20:53 UTC
    The fact that you have in your test data an apostrophe that does not delimit a quotation makes things harder (the apostrophe in not if it's quoted). I don't think there's a way to do the Right Thing that kind of fuzzy data, but according to your written specification, here's a way to do it.

    The trick is to think like a lexer. A quotation should be treated as one atomic entity, just like a character. Now the components that make up a sentence are entire quotations and individual non-period characters. Breaking it down this way, it's straightforward to see:

    my $data = qq[ this is some text. A period (".") usually terminates a statement. But not if it is quoted. Regardless of whether or not single quotes, '.', are used. And yes, "Mr. Ovid," even lines with a period in the middle of a quo +te. ]; my $doublequoted = qr/"[^"\\]*(?:\\.[^"\\]*)*"/m; my $singlequoted = qr/'[^'\\]*(?:\\.[^'\\]*)*'/m; my $sentence = qr/ (?: $singlequoted | $doublequoted | [^.] )* \. +/xm; my @items = $data =~ /($sentence)/g; print "[$_]\n" for @items;
    (delimited quote regexes stolen shamelessly from Abigail-II's Re: regex regexen) The only thing is to be careful that a quotation match is attempted first in the $sentence definition.

    This solution does not require the period(s) inside a quotation to be at the end of the quotation, which is a problem I think some of the other solutions suffer from.

    Update: to allow for non-terminator "." characters inside floating point numbers (as per Re^2: split $data, $unquoted_value;), here is a rough addition:

    my $float = qr/\d+\.\d+/; my $sentence = qr/ (?: $float | $singlequoted | $doublequoted | [^ +.] )+ \. /xm; $data .= "g = 9.8 m/s.";
    You can add other exceptions similarly...

    blokhead

Re: split $data, $unquoted_value;
by jch341277 (Sexton) on Sep 14, 2005 at 20:43 UTC

    I found Text::Sentence that appears to work on your example...

    When executed generates this

    Update: Changed "this is some text." to "This is some text." because the module apparently uses capitalization to identify sentence boundaries. So it might not work for you...

      Tempting, but as you suspected, it doesn't quite work. I actually need this to be able to better parse Prolog programs.

      Cheers,
      Ovid

      New address of my CGI Course.

Re: split $data, $unquoted_value;
by Codon (Friar) on Sep 14, 2005 at 20:40 UTC
    Put this in your split and see how it fits:
    my $re = qr/(?<!["'])(?<=\.)\s*(?!["'])/;
    This will split behind the period, but only if it is not (directly) quoted.

    Edit: added \s* to strip out space between sentences.

    Ivan Heffner
    Sr. Software Engineer, DAS Lead
    WhitePages.com, Inc.
Re: split $data, $unquoted_value;
by GrandFather (Saint) on Sep 14, 2005 at 20:41 UTC

    A regex that catches common cases is fairly easy. However handling nasty nested cases with full stops that are not adjacent to a quote heads into CSV teritory. Here's the simple case with an example of not handling something a little nastier:

    use warnings; use strict; my $str = do {local $/ = ''; <DATA>;}; print join "^", split /(?<!['"])\.(?!['"])/, $str; __DATA__ this is some text. A period (".") usually terminates a statement. But not if it's quoted. Regardless of whether or not single quotes, '.', are used. "Quoted sentences with a . in the middle." could be harder to manage h +owever.

    Generates:

    this is some text^ A period (".") usually terminates a statement^ But not if it's quoted^ Regardless of whether or not single quotes, '.', are used^ "Quoted sentences with a ^ in the middle." could be harder to manage h +owever

    Perl is Huffman encoded by design.
Re: split $data, $unquoted_value;
by QM (Parson) on Sep 14, 2005 at 21:45 UTC
    Do you need to consider periods in decimals and abbreviations?
    Mr. Smith paid Jr. $7.43 for 0.4 lbs. of coffee.
    You'll find that the period is overused and highly context sensitive. There are even ambiguities that cannot be determined contextually, but need semantic hints (or the original author to clarify).

    Perhaps other language designers avoided the period on purpose?

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      Ah, crud. Yes. I do need to consider decimals (though not abbreviations).

      loves("Mr. Poe", perl) :- version(perl, Version), Version >= 5.8.

      Cheers,
      Ovid

      New address of my CGI Course.

Re: split $data, $unquoted_value;
by kvale (Monsignor) on Sep 14, 2005 at 20:44 UTC
    CSV and more generally, xSV, documents have similar quoting problems to the above, so you might look at the module Text::xSV for pointers I don't know if Text::xSV will handle quotes only around a portion of a field, but it is a start on the finite state mchine that would handle these sorts of problems.

    -Mark

Re: split $data, $unquoted_value;
by Roy Johnson (Monsignor) on Sep 14, 2005 at 21:28 UTC
    You already have a split that does what you're describing in your update.
    @data = split /($RE{quoted})|\./;
    By putting the "unless" portion in capturing parens, they are captured and included with the results. I think that when a dot is encountered, it will generate an empty data item, so you might want to grep out empty results.
    @data = grep length, split /($RE{quoted})|\./;
    Update: Gah, that's not the same thing. You really need to do two passes: split as above, then join anything that isn't separated by an empty string element. Something like:
    my $accum; @data = map { if ($length) { $accum .= $_; () } else { my $x = $accum; $accum = ''; $x } } split /($RE{quoted})|\./; push @data, $accum;
    Caveat: I can't test code today.

    Caution: Contents may have been coded under pressure.

      I also need to capture the split value (in this case, just a period). However, you did remind me of that useful feature of split returning items in capturing parens. If my theoretical Data::Record module comes to light, I can use that to control the chomp-like behavior.

      Cheers,
      Ovid

      New address of my CGI Course.

      Great idea, this works.

      #!/usr/bin/perl use strict; use warnings; use Regexp::Common qw/delimited number/; use YAML; my $x = 'Foo "ba . r". Baz. $2.67 per pound. is a "." in a sentence re +ally a "."?'; my @x = split /($RE{delimited}{-delim=>'"'}|$RE{num}{real}|\.)/, $x; my @y = (''); for (@x) { $y[-1] .= $_; push @y, '' if $_ eq '.'; } pop @y if $y[-1] eq ''; print Dump \@y;

      Good Day,
          Dean

Re: split $data, $unquoted_value;
by ruoso (Curate) on Sep 14, 2005 at 22:52 UTC

    I have the same problem when I want to split CSV files into its columns. The point I got is that I couldn't do it with a regexp, so I did it the way it would be done without regexps...

    Basically, I read character by character and push the contexts I enter... For instance....

    "This is a column","Yes, I know",12323,23123.23,"This is a \"column\""

    should be splitted in

    This is a column Yes, I know 12323 23123.23 This is a "column"

    ok, no more talking... the code talks by itself...

    #!/usr/bin/perl -w use strict; my $origin = '"This is a column","Yes, I know",123123,23123.23,"This i +s a \"column\""'; my @cols = parse_line($origin); print join("\n", @cols)."\n"; sub parse_line { my $line = shift; my @contexts; my $context = ""; my $column; my @cols; my $string_delim = '"'; my $escape_char = "\\"; my $field_delim = ','; for (my $i = 0; $i < length $line; $i++) { my $c = substr($line, $i, 1); if ($c eq $string_delim) { if ($context eq "string") { $context = shift @contexts; } elsif ($context eq "escape") { $column .= $c; $context = shift @contexts; } else { push @contexts, $context; $context = "string"; } } elsif ($c eq $escape_char) { if ($context eq "escape") { $column .= $c; $context = shift @contexts; } else { push @contexts, $context; $context = "escape"; } } elsif ($c eq $field_delim) { if ($context eq "string") { $column .= $c; } elsif ($context eq "escape") { $column .= $c; $context = shift @contexts; } else { push @cols, $column; undef $column; } } else { $column .= $c; } if ($i == length($line) - 1) { push @cols, $column; undef $column; } } return @cols; }
    daniel
      Ruoso, wouldn't Text::CSV and related modules (XS, PP) work for you? I suppose you know this module, so what am I missing?
      #!/usr/bin/perl -w use Text::CSV_XS; my $csv = Text::CSV_XS->new( { 'escape_char' => '\\' } ); my $line = '"This is a column","Yes, I know",12323,23123.23,"This is a + \"column\""'; if ( $csv->parse($line) ) { print join "\n", $csv->fields, "\n"; } else { print "Could not parse line\n", $csv->error_input, "\n"; }
      motobói

        Well...

        Good question ;)...

        Maybe I thought it was easier to write the code then looking for a module... how dumb I am... anyway, it was fun to write it... :)

        daniel
Re: split $data, $unquoted_value;
by demerphq (Chancellor) on Sep 15, 2005 at 10:40 UTC

    Heres my go. It assumes that if a quote is used as a quote it won't be surrounded on both sides by an alpha char.

    $_=<<END; this is some text. A period (".") usually terminates a statement. But not if it's quoted. Regardless of whether or not single quotes, '.', are used. END #' my $accept=qr/(?: (?<![a-z]) (?: [''] (?:[^''\\]+|\\.)* [''] | [""] (?:[^""\\]+|\\.)* [""] ) (?![a-z]) | [^\\''"".]+ | [''""] )+/xi; my @parts; while (/($accept|([.][\r\n]*))/gxi) { push @parts,$1 unless $2; } print "part1: $_\n" for @parts; my @parts2=split / (?!$accept)([.][\r\n]*)/xi, $_; print "---\n"; print "part2: $_\n" for @parts; __END__ part1: this is some text part1: A period (".") usually terminates a statement part1: But not if it's quoted part1: Regardless of whether or not single quotes, '.', are used --- part2: this is some text part2: A period (".") usually terminates a statement part2: But not if it's quoted part2: Regardless of whether or not single quotes, '.', are used

    Just in case anyone wonders, the [''] trick is harmless, perl just ignores dupes in a char class, but it keeps some syntax highlighting editors (like mine) from getting confused.

    ---
    $world=~s/war/peace/g