Re: Not quite a simple split

I once tried to use logic inside a split(), and it produced buggy results:

@parts = split m{
  "            # if we match a quote
  (?{ ++$x })  # increment quote counter
  (?!)         # and fail
  |
  \s+          # or if we match whitespace
  (?(?{$x&1})  # if $x is odd
    (?!)       # fail
  )            # (otherwise succeed)
}x, q{A B "C D" E F"G H" I};
[download]

But it doesn't work right. 0's and 2's end up in the output. It's crappy.

I'd use a regex, not split().

@parts = $string =~ m{
  (?=\S)       # so long as there's something ahead of us:
  [^\s"]*      # non-quotes non-whitespace
  (?:
    " [^"]* "  # a quoted part
    [^\s"]*    # non-quotes non-whitespace
  )*           # zero or more times
}xg;
[download]

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Comment on Re: Not quite a simple split Select or Download Code

Replies are listed 'Best First'.
Re: Re: Not quite a simple split by John M. Dlugosz (Monsignor) on Feb 02, 2004 at 03:23 UTC
Ah, the master speaks. When I saw your post filled with (? syntax, I knew you had addressed the subtleties of the problem. So, let me understand... the first thing, `(?=\S)` will fail if the next character is whitespace or there is no next character. I wonder why we need that? Ah, it interacts with the /g to say "no match at this postion" to actually skip the spaces! And the spaces naturally don't wind up in the returned array. Beautiful. I also like the way the non-quote stuff is always first, and the quoted part is an optional part that follows, rather than having two totally different cases. So first it matches everything that's not whitespace or a quote. Then it picks up the quote, stuff inside it, and close quote. Then it has `[^\s"]` again, and the whole thing is in a repeat star. That means it will handle anything with an even number of quotes in it, not just a single pair and end on the close-quote. That is an interesting generalization, and I rather like it. I suppose you couldn't pull the non-quote non-whitespace part out of the loop because it must be performed at least once. Ah, but you know it's not a space already, and taking out the 3rd line and changing the 7th line from to + would work, and further allow things that begin with a quote. Would it not? —John	[reply] [d/l] [select]
Re: Re: Re: Not quite a simple split by japhy (Canon) on Feb 02, 2004 at 14:15 UTC
The regex could be changed to: `@parts = $string =~ m{ ( (?: [^\s"]+ # one or more non-quote/whitespace \| " [^"]* " # a quoted part )+ # one or more times ) }xg;` [download] This probably runs better. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l]
Re: Re: Re: Re: Not quite a simple split by John M. Dlugosz (Monsignor) on Feb 02, 2004 at 16:25 UTC
Why don't you need the `(?=\S)` in this version? Because it doesn't match an empty string (in the first one, either part was optional which made both parts optional at the same time)?	[reply] [d/l]
Re: Re: Re: Re: Re: Not quite a simple split by japhy (Canon) on Feb 02, 2004 at 19:16 UTC
Re^4: Not quite a simple split by Roy Johnson (Monsignor) on Feb 02, 2004 at 19:52 UTC
Your code counts `8"foo"8` as one token, but I think the token is supposed to terminate after the 2nd quote, leaving the second 8 as a separate token. `m/([^\s"]+\|[^\s"](?:"[^"]"))/g` [download] The PerlMonk `tr///` Advocate	[reply] [d/l] [select]