in reply to Re: Comma separated list into a hash
in thread Comma separated list into a hash

Your approach grabs words. The issue the OP is attempting to solve is to grab stuff between commas. A better approach would have been to do something like:
my @words = /([^,]+)/g;
But, that suffers from the same problems that split does when dealing with CSV data, specifically how to handle commas that belong in the data value. This is why regexen and split are poor choices for dealing with CSV data. Parsers are appropriate.

------
We are the carpenters and bricklayers of the Information Age.

Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

Replies are listed 'Best First'.
Re: Re: Re: Comma separated list into a hash
by revdiablo (Prior) on Apr 26, 2004 at 17:55 UTC
    how to handle commas that belong in the data value

    If quotes are used to disambiguate, it's fairly easy to parse with a regex:

    Of course, that suffers from other problems, such as going all wonky with unbalanced quotes. But it's a fairly simple way to parse well-formed data, and I thought it might be helpful for some to look at.

      Oooh, close. It is a good regex, but it suffers from the following issues:
      1. As you say, it won't handle mal-formed data. A major part of a parser's job is to detect data that doesn't conform to the specification. Parsing XML would be easier with a regex if you didn't have to handle error conditions ... *grins*
      2. If you have your element surround by "'s, then an embedded " is encoded as "".
      3. You assume that the element will be surround by double-quotes, but single-quotes / apostrophes are also legal
      4. Embedded newlines are also legal, but your regex won't handle them. (Text::CSV doesn't handle them, either, but Text::xSV does.)
      5. This is a nit, but you don't handle whitespace at the end of the line. A simple \s* would handle that.
      6. You don't handle whitespace between the closing double-quote and the comma. </ol

        ------
        We are the carpenters and bricklayers of the Information Age.

        Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

        suffers from the following issues

        Oh yeah, I'm sure there are plenty of problems. I wasn't really attempting to build a general-purpose CSV parser, just demonstrating that it's not too terribly difficult to handle this kind of thing (for some values of "handle") with a regex.

        A major part of a parser's job is to detect data that doesn't conform to the specification.

        Indeed. That's where a single regex solution generally falls down flat. Perhaps one could make a pre-scanner that looks for problems ahead of time, but for handling arbitrary, user-supplied data, a real parser should be built (or grabbed from CPAN, as it were).

        Just as a side note, I use this same technique to parse Apache logs. It's simply a matter of my @logentry = /("[^"]+"|\[[^\]]+\]|\S+)/g; and the log entry is split up nicely. Notice it handles both quote-delimited and square-bracket-delimited chunks. It looks messy, but it's dead simple. Perhaps one could even use variables to make it more readable:

        my $quoted = qr/" [^"]+ "/x; my $bracketed = qr/\[ [^\]]+ \]/x; my $bare = qr/ \S+ /x; while (<LOGFILE>) { my @logentry = /($quoted|$bracketed|$bare)/g; }

        Hopefully I haven't strayed too far off the point. Not that anyone will probably read this deeply into the thread anyway, but oh well. 8^)