jblakey has asked for the wisdom of the Perl Monks concerning the following question:

Oh great monks,

i seek an answer to this puzzling regex query, for i myself have meditated on it for a number of hours and remain unenlightened.

i wish to break a line of data into its parts. Here are some sample lines.

"one"," one,two",3,"a like <A HREF="foo"> b</A>"
0,0,0,0

Having the ,'s and "'s inside the exterior "'s makes this somewhat tricky for me.
Any comments?
Thanks jason

Replies are listed 'Best First'.
Re: Regex Conundrum
by SciDude (Friar) on Jun 12, 2004 at 19:38 UTC

    There is a standard solution to this problem, mostly from Mastering Regular Expressions:

    # You need to match a double quoted string with the following regex # [^"\\]*(\\.[^"\\]*)*",? # # But to get the text between double quotes use some ( ) # ([^"\\]*(\\.[^"\\]*)*)",? # gets text inside quotes as $1 # # but you also have non quoted fields, thus # ([^,]+),? # which should match things optionally followed by a comma # # and then a match for separation commas # , # # this must be repeated with m/.../g

    Before attempting this yourself, take at look at Text::ParseWords and the quotewords routine. This should solve your problem. If the module is not available to you then the following untested code from Mastering Regular Expressions should work:

    @fields = (); while ($text =~ m/"([^"\\]*(\\.[^"\\]*)*)",?|([^,]+),?|,/g { push (@fields, defined ($1) ? $1 : $3) ; } push (@fields, undef) if $text =~ m/,$/; # Account for the special cas +e of an empty last field. # all data is now in @fields

    Note: untested.


    SciDude
    The first dog barks... all other dogs bark at the first dog.
Re: Regex Conundrum
by Nkuvu (Priest) on Jun 12, 2004 at 20:46 UTC

    It seems to me that you're trying to parse a comma-delimited entries. Each entry may or may not be quoted. If quoted, the entry may contain commas. Sounds like a good job for Text::xSV or similar.

    Keep in mind that most languages use escape routines to indicate embedded delimiters. Perl, for example, would require that the string in question be something like "a like <A HREF=\"foo\"> b</A>" in order to avoid bareword errors on foo.

Re: Regex Conundrum
by sweetblood (Prior) on Jun 12, 2004 at 20:28 UTC
    You can try Text::Balanced, however there are some problems to look out for. One is if your data contain double quotes inside your quoted fields ie:
    2,5,"foobar","tape 2" white", 0122435992020<br> ^
    This double quote used to indicate inches will break most parsing methods.

    Of course if you don't have data like that, you'll be fine.

    Best of Luck

    Cheers

    Sweetblood

Re: Regex Conundrum
by davido (Cardinal) on Jun 13, 2004 at 04:46 UTC
    It looks to me like you're trying to parse Text from Comma Separated Values. Hmm...I wonder if anyone's ever done this before?? Hmm... CPAN might be a good place to start looking... Wha? What's this? ...Text::CSV! Eureka, you're not the first person to have such a need! (I'm teasing, of course).

    Here's an example straight from the Text::CSV POD:

    use Text::CSV; $csv = Text::CSV->new(); # create a new object $status = $csv->parse($line); # parse a CSV string into fields @columns = $csv->fields(); # get the parsed fields

    Don't use a regex to do what can be done more reliably with a tried and proven module. Text::CSV is the right tool for the job. You'll find it even properly handles commas embeded in quoted strings.


    Dave

      Better is Text::xSV. In addition to allowing any character as the separator (instead of hard-coding comma), it actually takes into account embedded newlines, which Text::CSV doesn't do. Plus, it's a pure-perl solution that's almost as fast as Text::CSV_XS. And, tilly wrote it. :-)

      ------
      We are the carpenters and bricklayers of the Information Age.

      Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

      I shouldn't have to say this, but any code, unless otherwise stated, is untested

        Great advice. ...a new favorite module. :)


        Dave

Re: Regex Conundrum
by snax (Hermit) on Jun 12, 2004 at 18:48 UTC
    You're not terribly clear in your goals but if you just want the pieces in an array, use
    my $data = q("one"," one,two",3,"a like <A HREF="foo"> b</A>"); my @pieces = split(q(,), $data);
    This assumes that you have a comma-delimited set of data, which appears to be true from your sample.
    Update:
    I need to be more careful, don't I? Sorry that I didn't see the embedded comma. As a result, my suggestion is worthless.
Re: Regex Conundrum
by injunjoel (Priest) on Jun 13, 2004 at 01:30 UTC
    Greetings all,
    Well here is my home-brew solution so to speak.
    #!/usr/bin/perl -w use strict; use Dumpvalue; my $dumper = new Dumpvalue; my @split_elms = map{chomp; my $line = $_; my @data = map{my $substr = $_; $substr =~ s/,/:innerc:/g; $line =~ s/\Q$_\E/$substr/;} $line =~ /("[^"]*")/g; @data = map{$_ =~ s/:innerc:/,/g; $_} split(/,/,$line); \@data;} <DATA>; $dumper->dumpValue(\@split_elms); exit; __DATA__ "one"," one,two",3,"a like <A HREF="foo"> b</A>" 0,0,0,0

    it outputs
    0 ARRAY(0x81a89dc) 0 '"one"' 1 '" one,two"' 2 3 3 '"a like <A HREF="foo"> b</A>"' 1 ARRAY(0x8151afc) 0 0 1 0 2 0 3 0

    That was fun... I hope that helps a bit. Of course it still fails to catch things like a single " (like for inches) but you get the general idea. Nothing novel.
    -Injunjoel