comment on

Thanks to the responses on split $data, $unquoted_value;, I've released an alpha of Data::Record to the CPAN. Here's a quick example of a CSV parser which assumes that commas and newlines may both be in quotes:</code>

  use Data::Record;
  use Regexp::Common;
  # four lines, but there are only three records! (newline in quotes)
  my $text = <<'END_DATA';
  1,2,"programmer, perl",4,5
  1,2,"programmer,
  perl",4,5
  1,2,3,4,5
  END_DATA
  
  my $data = Data::Record->new({
      split  => "\n",
      unless => $RE{quoted},
      trim   => 1,
      fields => {
          split  => ",",
          unless => $RE{quoted},
      }
  });

  my @records = $data->records($text);
  foreach my $fields (@records) {
      foreach my $field = (@$fields);
          # do something
      }
  }
[download]

The way this works is conceptually simple, but creates a subtle problem. I "mask out" all data which matches the "unless" regex, split on the "split" value and restore the masked data. This makes things relatively simple. However, it does require that I read all of the data in at once.

It's already been requested that I allow streamed data. That would be a huge benefit, but I'm trying to think of the best way to do that and I'm hoping more experienced monks may be able to offer pointers.

Internally, the module tries to find a "token" that's a string not found in the text. Once it does, it uses that in the data mask. However, with streaming text, I cannot verify that the token is not in the text and the user will have to supply the token. Also, if I receive part of a line, I cannot guarantee that a failed "unless" match is a false negative since more data read from the stream may satisfy the "unless" regex.

I can't really think of any clean way of easily dealing with the latter problem. Does anyone have any experience with something like this?

Cheers,
Ovid

Update: fixed typo pointed out by japhy and fishbot_v2.

New address of my CGI Course.

In reply to Munging Streamed Data by Ovid

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.