Ovid has asked for the wisdom of the Perl Monks concerning the following question:
Thanks to the responses on split $data, $unquoted_value;, I've released an alpha of Data::Record to the CPAN. Here's a quick example of a CSV parser which assumes that commas and newlines may both be in quotes:</code>
use Data::Record; use Regexp::Common; # four lines, but there are only three records! (newline in quotes) my $text = <<'END_DATA'; 1,2,"programmer, perl",4,5 1,2,"programmer, perl",4,5 1,2,3,4,5 END_DATA my $data = Data::Record->new({ split => "\n", unless => $RE{quoted}, trim => 1, fields => { split => ",", unless => $RE{quoted}, } }); my @records = $data->records($text); foreach my $fields (@records) { foreach my $field = (@$fields); # do something } }
The way this works is conceptually simple, but creates a subtle problem. I "mask out" all data which matches the "unless" regex, split on the "split" value and restore the masked data. This makes things relatively simple. However, it does require that I read all of the data in at once.
It's already been requested that I allow streamed data. That would be a huge benefit, but I'm trying to think of the best way to do that and I'm hoping more experienced monks may be able to offer pointers.
Internally, the module tries to find a "token" that's a string not found in the text. Once it does, it uses that in the data mask. However, with streaming text, I cannot verify that the token is not in the text and the user will have to supply the token. Also, if I receive part of a line, I cannot guarantee that a failed "unless" match is a false negative since more data read from the stream may satisfy the "unless" regex.
I can't really think of any clean way of easily dealing with the latter problem. Does anyone have any experience with something like this?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Munging Streamed Data (one more re)
by tye (Sage) on Sep 19, 2005 at 20:53 UTC | |
|
Re: Munging Streamed Data
by graff (Chancellor) on Sep 19, 2005 at 20:14 UTC |