Not quite a simple split

John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Not quite a simple split by japhy (Canon) on Feb 01, 2004 at 03:19 UTC
I once tried to use logic inside a split(), and it produced buggy results: `@parts = split m{ " # if we match a quote (?{ ++$x }) # increment quote counter (?!) # and fail \| \s+ # or if we match whitespace (?(?{$x&1}) # if $x is odd (?!) # fail ) # (otherwise succeed) }x, q{A B "C D" E F"G H" I};` [download] But it doesn't work right. 0's and 2's end up in the output. It's crappy. I'd use a regex, not split(). `@parts = $string =~ m{ (?=\S) # so long as there's something ahead of us: [^\s"]* # non-quotes non-whitespace (?: " [^"]* " # a quoted part [^\s"]* # non-quotes non-whitespace )* # zero or more times }xg;` [download] _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l] [select]
Re: Re: Not quite a simple split by John M. Dlugosz (Monsignor) on Feb 02, 2004 at 03:23 UTC
Ah, the master speaks. When I saw your post filled with (? syntax, I knew you had addressed the subtleties of the problem. So, let me understand... the first thing, `(?=\S)` will fail if the next character is whitespace or there is no next character. I wonder why we need that? Ah, it interacts with the /g to say "no match at this postion" to actually skip the spaces! And the spaces naturally don't wind up in the returned array. Beautiful. I also like the way the non-quote stuff is always first, and the quoted part is an optional part that follows, rather than having two totally different cases. So first it matches everything that's not whitespace or a quote. Then it picks up the quote, stuff inside it, and close quote. Then it has `[^\s"]` again, and the whole thing is in a repeat star. That means it will handle anything with an even number of quotes in it, not just a single pair and end on the close-quote. That is an interesting generalization, and I rather like it. I suppose you couldn't pull the non-quote non-whitespace part out of the loop because it must be performed at least once. Ah, but you know it's not a space already, and taking out the 3rd line and changing the 7th line from to + would work, and further allow things that begin with a quote. Would it not? —John	[reply] [d/l] [select]
Re: Re: Re: Not quite a simple split by japhy (Canon) on Feb 02, 2004 at 14:15 UTC
The regex could be changed to: `@parts = $string =~ m{ ( (?: [^\s"]+ # one or more non-quote/whitespace \| " [^"]* " # a quoted part )+ # one or more times ) }xg;` [download] This probably runs better. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l]
Re: Re: Re: Re: Not quite a simple split by John M. Dlugosz (Monsignor) on Feb 02, 2004 at 16:25 UTC
Re: Re: Re: Re: Re: Not quite a simple split by japhy (Canon) on Feb 02, 2004 at 19:16 UTC
Re^4: Not quite a simple split by Roy Johnson (Monsignor) on Feb 02, 2004 at 19:52 UTC
Re: Not quite a simple split by Zaxo (Archbishop) on Feb 01, 2004 at 02:41 UTC
How about, `my @stuff = grep {defined} split /\s+(?:"([^"])")?\s/;` if you can ignore or filter out some undef elements. Captured chunks from the split regex get into the resulting list. After Compline, Zaxo	[reply] [d/l]
Re: Re: Not quite a simple split by John M. Dlugosz (Monsignor) on Feb 01, 2004 at 23:20 UTC
I don't understand. As I read it, the split will take whitespace and a quoted string as the delimiter. So it will return all the tokens that are not quoted strings. I guess the undef return has to do with matching multiple times in the same gap? I thought split was specifically supposed to not do that.	[reply]
Re: Re: Re: Not quite a simple split by Zaxo (Archbishop) on Feb 01, 2004 at 23:36 UTC
I think you missed the '?' quantifier after the quoted-string group. It is allowed to be absent, so the split will accept whitespace alone. It also eats trailing whitespace after a quoted section. The captured string between the quotes is the only element of the regex that is passed into the list result of split. If there is no quoted string, that capture is present, but undef. Hence the grep filter. After Compline, Zaxo	[reply]
Re: Re: Re: Re: Not quite a simple split by John M. Dlugosz (Monsignor) on Feb 02, 2004 at 05:23 UTC
Re: Not quite a simple split by antirice (Priest) on Feb 01, 2004 at 03:17 UTC
I'd probably go for the regex just because it's the first thing I thought of and I didn't consider your problem for too long (hey, at least I'm honest :). But this should do: `my @tokens = $string =~ /([^ ]+".?"\|[^ ]+)/g;` [download] Update*: Forgot you could have something before the first quote. antirice The first rule of Perl club is - use Perl The ith rule of Perl club is - follow rule i - 1 for i > 1	[reply] [d/l]
Re: Not quite a simple split by bart (Canon) on Feb 01, 2004 at 14:45 UTC
There's a FAQ entry (no, not on this site, but there is Perl life outside of it, too, you know :): How can I split a [character] delimited string except when inside [character]? (Comma-separated files) Anyway, aside from that, my first though would be along these lines: `@tokens = /(?:".*?"\|\S)+/g;` [download] which, with the string `$_ = 'Here we have u"a quoted string" and a .';` [download] produces, with each item of `@tokens` on a separate line: Here we have u"a quoted string" and a . Looks fine to me.	[reply] [d/l] [select]
Re: Re: Not quite a simple split by John M. Dlugosz (Monsignor) on Feb 02, 2004 at 03:33 UTC
You and BrowserUK had what I almost came up with the other night. When switching from split to re/g, I was missing the part about it scanning to the next match, skipping stuff that doesn't match (spaces). Hmm, but that means it will silently skip syntax errors, too! I wonder...	[reply]
Re: Not quite a simple split by BrowserUk (Patriarch) on Feb 01, 2004 at 12:02 UTC
If I understood the 'spec'? `$s = 'The 8"quick brown" fox jumps over the U"lazy" dog'; print join'\|', $s =~ m[ \s* ( (?: [8U]"[^"]+" ) \| \S+ ) ]gx; The\|8"quick brown"\|fox\|jumps\|over\|the\|U"lazy"\|dog` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Timing (and a little luck) are everything!	[reply] [d/l]
Re: Not quite a simple split by graff (Chancellor) on Feb 02, 2004 at 04:24 UTC
It looks like you got the help you were after, but I couldn't resist a comment on this bit: I don't want to have to go to a full-blown fancy parser just to handle this one little case. I think a two-pass system could do it... But that seems in-elegant. Well, I'd have to ask: What do you really think you want? A quickie solution for `one little case', or an elegant solution? Granted, these alternatives are not always mutually exclusive, but one of the things that should make a solution "quick" is simplicity, whereas "elegance" is often assigned to things that are more subtle than they are simple. I guess the real question is whether the extra time and effort to create elegance is worthwhile for the given task.	[reply]
Re: Re: Not quite a simple split by John M. Dlugosz (Monsignor) on Feb 02, 2004 at 05:30 UTC
What I want is: Quick to code. One line is terrific! Easy to define. Simply state "separate at whitespace" is very simple. That's what Forth does. I want to have "...except for quoted ones" without shifting to a more complex kind of grammar. I'd like to write it as a built-in Perl expression, rather than loading (and learning) a parser module. Perhaps if I had Perl6 patterns I'd say that anything built-in is fine.	[reply]