dsheroh has asked for the wisdom of the Perl Monks concerning the following question:

Yes, I know this is an oft-used wheel. I know it's out there (and probably even on this very site), but my searches have proved futile.

I have a data file that looks something like this:
"72 3267S" "S2079" 1 no
"72 0250" "S3011" 1 no
"72 8351S" "S6101" 1 no
"72 17082S" "S6108" 1 no

Space-delimited, with the possibility of spaces in the data and surrounded by double quotes. (This sample has a space in the first column of every row and none in the second column, but this cannot be relied on in general.)

I figure there's got to be a way to do this other than building a state machine and walking the string a character at a time but split/regexes aren't up to the task. So what's the preferred method?

  • Comment on Splitting strings with enclosed delimiters

Replies are listed 'Best First'.
Re: Splitting strings with enclosed delimiters
by insensate (Hermit) on Oct 03, 2002 at 18:31 UTC
    How about this:
    use Text::ParseWords; use strict; while(<DATA>){ my @line=quotewords(" ",0,$_); print join"\n",@line; } __DATA__ "72 3267S" "S2079" 1 no "72 0250" "S3011" 1 no "72 8351S" "S6101" 1 no "72 17082S" "S6108" 1 no

    -Jason
      Very slick and just as straightforward as a split (although quotewords isn't as obvious about what it does). Thanks!
Re: Splitting strings with enclosed delimiters
by fglock (Vicar) on Oct 03, 2002 at 18:26 UTC
Re: Splitting strings with enclosed delimiters
by Corion (Patriarch) on Oct 03, 2002 at 18:29 UTC

    I'd still use a regex to collect the contents of the lines :

    use strict; my $line; while ($line = <DATA>) { chomp $line; print $line,"\n"; while ($line =~ s/(?:^([^" ]+)|^"([^"]+)")(?: |$)//) { print ">>>",$1||$2,"\n"; print $line,"\n"; }; }; __DATA__ "72 3267S" "S2079" 1 no "72 0250" "S3011" 1 no "72 8351S" "S6101" 1 no "72 17082S" "S6108" 1 no

    The regex nibbles from the start of each line either something starting with a non-quote, or a quote and then everything including the ending quote and the separating space. At the end of a line, no space is allowed.

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
      Nibbling regex are expensive. Change it to a scalar m/\G.../g walk, and you have a winner:
      use strict; my $line; while ($line = <DATA>) { chomp $line; print $line,"\n"; while ($line =~ m/\G(?:([^" ]+)|"([^"]+)")(?: |$)/g) { print ">>>",$1||$2,"\n"; print $line,"\n"; }; }; __DATA__ "72 3267S" "S2079" 1 no "72 0250" "S3011" 1 no "72 8351S" "S6101" 1 no "72 17082S" "S6108" 1 no

      -- Randal L. Schwartz, Perl hacker

Re: Splitting strings with enclosed delimiters
by hossman (Prior) on Oct 03, 2002 at 21:43 UTC

    Text::ParseWords::quotewords will work given the few sample lines of input you have provided, but it makes certain assumptions about your data format:

    • that "spaces" can be escaped using backslash (ie: \ ) and thus considered a "word character".
    • that a double-quote can be escaped using a backslash (ie: \" and thus considered a "word character".

    If these assumptions don't match your data, you might want to take a look at Text::CSV_XS ... the constructor allows you to specify attributes like: what quote character to use, what seperator character is, what escape character to use.

Re: Splitting strings with enclosed delimiters
by flounder99 (Friar) on Oct 03, 2002 at 21:29 UTC
    The nice and simple regex /([^" ]+|"[^"]*")/g works fine if you don't mind keeping the quotes. You can always strip them later if you want.
    while (<DATA>) { chomp; my @array = /([^" ]+|"[^"]*")/g; foreach (@array) { # strip quotes s/^"//; s/"$//; print "'$_'\n"; } print "\n"; } __DATA__ "72 3267S" "S2079" 1 no "72 0250" "S3011" 1 no "72 8351S" "S6101" 1 no "72 17082S" "S6108" 1 no __OUTPUT__ '72 3267S' 'S2079' '1' 'no' '72 0250' 'S3011' '1' 'no' '72 8351S' 'S6101' '1' 'no' '72 17082S' 'S6108' '1' 'no'

    --

    flounder