A common problem that I face in parsing complex data is needing to split the data on an unquoted value. For example, consider the following text.

this is some text. A period (".") usually terminates a statement. But not if it's quoted. Regardless of whether or not single quotes, '.', are used.

It would be nice to be able to split that into 4 individual records but just splitting on a period won't work. However, this problem is general enough that it would be nice to create a "super split" that will split data into discrete elements, but only if the data you are splitting on matches certain more complex parameters (such as being quoted, in this case).

I haven't seen a module that offers this general functionality but it's possible I missed something. Can anyone offer suggestions? Something for the specific case would be fine, but a general purpose solution would be awesome.

Update: after reading the replies, a different strategy occurs to me. Supplying an "unless" option would be helpful.

use Regexp::Common; use Data::Record; # doesn't exist my $record = Data::Record->new( split => qr/\./, unless => $RE{quoted}, ); my @data = $record->split($data);

Internally, it would be a bit inefficient in that it would have to read all of the data at once. Then, it would go through the data and find all text that matches the "unless" and "split" regexen and replace that with a unique token that does not match the split token. Then, it could just split the data. It iterates over the resulting records and replaces the tokens with the original text. I believe Filter::Simple used a similar strategy.

Cheers,
Ovid

New address of my CGI Course.


In reply to split $data, $unquoted_value; by Ovid

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.