Ovid has asked for the wisdom of the Perl Monks concerning the following question:

Thanks to the responses on split $data, $unquoted_value;, I've released an alpha of Data::Record to the CPAN. Here's a quick example of a CSV parser which assumes that commas and newlines may both be in quotes:</code>

use Data::Record; use Regexp::Common; # four lines, but there are only three records! (newline in quotes) my $text = <<'END_DATA'; 1,2,"programmer, perl",4,5 1,2,"programmer, perl",4,5 1,2,3,4,5 END_DATA my $data = Data::Record->new({ split => "\n", unless => $RE{quoted}, trim => 1, fields => { split => ",", unless => $RE{quoted}, } }); my @records = $data->records($text); foreach my $fields (@records) { foreach my $field = (@$fields); # do something } }

The way this works is conceptually simple, but creates a subtle problem. I "mask out" all data which matches the "unless" regex, split on the "split" value and restore the masked data. This makes things relatively simple. However, it does require that I read all of the data in at once.

It's already been requested that I allow streamed data. That would be a huge benefit, but I'm trying to think of the best way to do that and I'm hoping more experienced monks may be able to offer pointers.

Internally, the module tries to find a "token" that's a string not found in the text. Once it does, it uses that in the data mask. However, with streaming text, I cannot verify that the token is not in the text and the user will have to supply the token. Also, if I receive part of a line, I cannot guarantee that a failed "unless" match is a false negative since more data read from the stream may satisfy the "unless" regex.

I can't really think of any clean way of easily dealing with the latter problem. Does anyone have any experience with something like this?

Cheers,
Ovid

Update: fixed typo pointed out by japhy and fishbot_v2.

New address of my CGI Course.

Replies are listed 'Best First'.
Re: Munging Streamed Data (one more re)
by tye (Sage) on Sep 19, 2005 at 20:53 UTC

    You can add a regex for what "unfinished" means. In this case, it would be simply /"/. Then you

    /($skip)|($unfinished)|($delim)/gc

    along the string. Matching $unfinished means that you need more from the stream.

    Note that you also have to be careful with your $skip regex. A typical CSV definition for "quoted" is /"(""|[^"]+)*"/ (update: or even /"(""|[^"]+)*"(?!")/) but, for your module, you'd need to use <update> not /"(""|[^"]+)*"(?!")/ but /"(""|[^"]+)*"(?=[^"])/ (until you reach the end of the stream) or otherwise reject matches that hit the end of your current buffer while there is still data in the stream.

    And for some cases, even hitting close to the end of the stream needs to be disallowed. So you probably need a configurable "max bytes before end of stream" value that can default reasonably large (like 4kB) and probably not have to worry about again.</update>.

    Update: Note that this approach means that you don't need to remove things that match $skip so you don't need to come up with a replacement "token" that doesn't appear in the data.

    - tye        

Re: Munging Streamed Data
by graff (Chancellor) on Sep 19, 2005 at 20:14 UTC
    I don't think I'm more experienced, but...

    The way you come up with a "token" string amounts to trying different strings of unlikely characters repeated 6 times (e.g. "~~~~~~", "``````", etc). If the underlying assumption is that the module will always be used to munge text data, why not use odd-ball control characters for this function -- e.g. a string like "\x7f\x1f\x7f\x1f" is quite unlikely to show up in any human-readable text file, but it ought to serve your needs just as well as any "visible" character string. Even null bytes might do the trick.

    In any case, it seems like stream mode will complicate things for you rather a lot. The caller would need to pass a file handle, n'est-ce-pas? You'd have to be able to figure out when you've read up to a record boundary (without reading the whole file), which at first guess might involve reading fixed-length buffers and parsing to know whether the buffer ends with a partial record, partial field, or even a partial character (if handling utf8 data).

    (Or maybe you could set $/ based on the user-supplied "split" value -- except the latter can be a regex, which won't work for $/; and in any case, you still need to parse to know when a read buffer ends in mid-record because the given $/ instance happened to be within a quoted field.)