Re4: Comma separated list into a hash

Oooh, close. It is a good regex, but it suffers from the following issues:

As you say, it won't handle mal-formed data. A major part of a parser's job is to detect data that doesn't conform to the specification. Parsing XML would be easier with a regex if you didn't have to handle error conditions ... *grins*
If you have your element surround by "'s, then an embedded " is encoded as "".
You assume that the element will be surround by double-quotes, but single-quotes / apostrophes are also legal
Embedded newlines are also legal, but your regex won't handle them. (Text::CSV doesn't handle them, either, but Text::xSV does.)
This is a nit, but you don't handle whitespace at the end of the line. A simple \s* would handle that.
You don't handle whitespace between the closing double-quote and the comma. </ol

------
We are the carpenters and bricklayers of the Information Age.
Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose
Comment on Re4: Comma separated list into a hash

Replies are listed 'Best First'.
Re5: Comma separated list into a hash by revdiablo (Prior) on Apr 26, 2004 at 20:24 UTC
suffers from the following issues Oh yeah, I'm sure there are plenty of problems. I wasn't really attempting to build a general-purpose CSV parser, just demonstrating that it's not too terribly difficult to handle this kind of thing (for some values of "handle") with a regex. A major part of a parser's job is to detect data that doesn't conform to the specification. Indeed. That's where a single regex solution generally falls down flat. Perhaps one could make a pre-scanner that looks for problems ahead of time, but for handling arbitrary, user-supplied data, a real parser should be built (or grabbed from CPAN, as it were). Just as a side note, I use this same technique to parse Apache logs. It's simply a matter of `my @logentry = /("[^"]+"\|\[[^\]]+\]\|\S+)/g;` and the log entry is split up nicely. Notice it handles both quote-delimited and square-bracket-delimited chunks. It looks messy, but it's dead simple. Perhaps one could even use variables to make it more readable: `my $quoted = qr/" [^"]+ "/x; my $bracketed = qr/\[ [^\]]+ \]/x; my $bare = qr/ \S+ /x; while (<LOGFILE>) { my @logentry = /($quoted\|$bracketed\|$bare)/g; }` [download] Hopefully I haven't strayed too far off the point. Not that anyone will probably read this deeply into the thread anyway, but oh well. 8^)	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re5: Comma separated list into a hash
by revdiablo (Prior) on Apr 26, 2004 at 20:24 UTC

suffers from the following issues

Oh yeah, I'm sure there are plenty of problems. I wasn't really attempting to build a general-purpose CSV parser, just demonstrating that it's not too terribly difficult to handle this kind of thing (for some values of "handle") with a regex.

A major part of a parser's job is to detect data that doesn't conform to the specification.

Indeed. That's where a single regex solution generally falls down flat. Perhaps one could make a pre-scanner that looks for problems ahead of time, but for handling arbitrary, user-supplied data, a real parser should be built (or grabbed from CPAN, as it were).

Just as a side note, I use this same technique to parse Apache logs. It's simply a matter of my @logentry = /("[^"]+"|\[[^\]]+\]|\S+)/g; and the log entry is split up nicely. Notice it handles both quote-delimited and square-bracket-delimited chunks. It looks messy, but it's dead simple. Perhaps one could even use variables to make it more readable:

my $quoted    = qr/"  [^"]+   "/x;
my $bracketed = qr/\[ [^\]]+ \]/x;
my $bare      = qr/   \S+      /x;

while (<LOGFILE>) {
  my @logentry = /($quoted|$bracketed|$bare)/g;
}
[download]

Hopefully I haven't strayed too far off the point. Not that anyone will probably read this deeply into the thread anyway, but oh well. 8^)

[reply]
[d/l]
[select]