solution wanted for break-on-spaces (w/specifics)

perl-diddler has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: solution wanted for break-on-spaces (w/specifics) by AnomalousMonk (Archbishop) on Oct 24, 2021 at 07:59 UTC
Building on kcott's approach (and his test cases and their underlying assumptions), here's a regex-based solution. I've added a few test cases of my own, but their validity is questionable because I don't fully understand perl-diddler's requirements. No attempt has been made to compare performance. Win8 Strawberry 5.8.9.5 (32) Sun 10/24/2021 3:14:25 C:\@Work\Perl\monks >perl use strict; use warnings; use Test::More; use Test::NoWarnings; sub pp { local $" = '\| \|'; "\|@{$_[0]}\|"; } # for output pretty-print +ing my @tests = ( q{all '- and "-quotes properly balanced}, [ q{This is simple.}, [ q{This}, q{is}, q{simple.} + ] ], [ q{ This is simple. }, [ q{This}, q{is}, q{simple.} + ] ], [ q{This is "so very simple".}, [ q{This}, q{is}, q{"so very simple" +.} ] ], [ q{This "is so" very simple.}, [ q{This}, q{"is so"}, q{very}, q{si +mple.} ] ], [ q{This 'isn\'t nice.'}, [ q{This}, q{'isn\'t nice.'} + ] ], [ q{This "isn\"t nice."}, [ q{This}, q{"isn\"t nice."} + ] ], [ q{This 'isn\\\\'t nice.'}, [ q{This}, q{'isn\\\\'t}, q{nice.'} + ] ], [ q{This "isn\\\\"t nice."}, [ q{This}, q{"isn\\\\"t}, q{nice."} + ] ], [ q{This 'is not unnice.'}, [ q{This}, q{'is not unnice.'} + ] ], [ q{This "is not unnice."}, [ q{This}, q{"is not unnice."} + ] ], [ q{a "bb cc" d}, [ q{a}, q{"bb cc"}, q{d} + ] ], q{UNbalanced '- and "-quotes at absolute end of string}, [ q{This is "so very simple}, [ q{This}, q{is}, q{"so very simple} ] + ], [ q{This 'isn\'t nice.}, [ q{This}, q{'isn\'t nice.} ] + ], [ q{This "isn\"t nice.}, [ q{This}, q{"isn\"t nice.} ] + ], [ q{This 'isn\\\\'t nice.}, [ q{This}, q{'isn\\\\'t}, q{nice.} ] + ], [ q{This "isn\\\\"t nice.}, [ q{This}, q{"isn\\\\"t}, q{nice.} ] + ], [ q{This 'is not unnice.}, [ q{This}, q{'is not unnice.} ] + ], [ q{This "is not unnice.}, [ q{This}, q{"is not unnice.} ] + ], 'what about these questionable cases?', [ q{is this"really so"simple now?}, [ q{is}, q{this"really so"simple +}, q{now?} ] ], [ q{is this"really so" now?}, [ q{is}, q{this"really so"}, + q{now?} ] ], [ q{is "really so"simple now?}, [ q{is}, q{"really so"simple}, + q{now?} ] ], [ q{is this'really so'simple now?}, [ q{is}, q{this'really so'simple +}, q{now?} ] ], [ q{is this'really so' now?}, [ q{is}, q{this'really so'}, + q{now?} ] ], [ q{is 'really so'simple now?}, [ q{is}, q{'really so'simple}, + q{now?} ] ], ); my @additional = qw(Test::NoWarnings); # each of these adds 1 test plan 'tests' => (scalar grep { ref eq 'ARRAY' } @tests) + @additional ; # an escape \ escapes ANY character. my $rx_dq = qr{ " [^\\"]* (?: \\. [^\\"]) (?: " \| \z) }xms; my $rx_sq = qr{ ' [^\\']* (?: \\. [^\\']) (?: ' \| \z) }xms; my $rx_q = qr{ $rx_dq \| $rx_sq }xms; # match quoted or non-space substrings. alt order critical! # my $rx_extract = qr{ $rx_q \S* \| \S+ }xms; # for non-questionable c +ases my $rx_extract = qr{ [^'"\s]* $rx_q [^'"\s]* \| \S+ }xms; VECTOR: for my $ar_vector (@tests) { if (not ref $ar_vector) { note $ar_vector; next VECTOR; } my ($string, $ar_expected) = @$ar_vector; my @got = $string =~ m{ $rx_extract }xmsg; is_deeply \@got, $ar_expected, "\|$string\| -> " . pp $ar_expected; } # end for VECTOR ^Z 1..25 # all '- and "-quotes properly balanced ok 1 - \|This is simple.\| -> \|This\| \|is\| \|simple.\| ok 2 - \| This is simple. \| -> \|This\| \|is\| \|simple.\| ok 3 - \|This is "so very simple".\| -> \|This\| \|is\| \|"so very simple".\| ok 4 - \|This "is so" very simple.\| -> \|This\| \|"is so"\| \|very\| \|simple. +\| ok 5 - \|This 'isn\'t nice.'\| -> \|This\| \|'isn\'t nice.'\| ok 6 - \|This "isn\"t nice."\| -> \|This\| \|"isn\"t nice."\| ok 7 - \|This 'isn\\'t nice.'\| -> \|This\| \|'isn\\'t\| \|nice.'\| ok 8 - \|This "isn\\"t nice."\| -> \|This\| \|"isn\\"t\| \|nice."\| ok 9 - \|This 'is not unnice.'\| -> \|This\| \|'is not unnice.'\| ok 10 - \|This "is not unnice."\| -> \|This\| \|"is not unnice."\| ok 11 - \|a "bb cc" d\| -> \|a\| \|"bb cc"\| \|d\| # UNbalanced '- and "-quotes at absolute end of string ok 12 - \|This is "so very simple\| -> \|This\| \|is\| \|"so very simple\| ok 13 - \|This 'isn\'t nice.\| -> \|This\| \|'isn\'t nice.\| ok 14 - \|This "isn\"t nice.\| -> \|This\| \|"isn\"t nice.\| ok 15 - \|This 'isn\\'t nice.\| -> \|This\| \|'isn\\'t\| \|nice.\| ok 16 - \|This "isn\\"t nice.\| -> \|This\| \|"isn\\"t\| \|nice.\| ok 17 - \|This 'is not unnice.\| -> \|This\| \|'is not unnice.\| ok 18 - \|This "is not unnice.\| -> \|This\| \|"is not unnice.\| # what about these questionable cases? ok 19 - \|is this"really so"simple now?\| -> \|is\| \|this"really so"simple +\| \|now?\| ok 20 - \|is this"really so" now?\| -> \|is\| \|this"really so"\| \|now +?\| ok 21 - \|is "really so"simple now?\| -> \|is\| \|"really so"simple\| \|n +ow?\| ok 22 - \|is this'really so'simple now?\| -> \|is\| \|this'really so'simple +\| \|now?\| ok 23 - \|is this'really so' now?\| -> \|is\| \|this'really so'\| \|now +?\| ok 24 - \|is 'really so'simple now?\| -> \|is\| \|'really so'simple\| \|n +ow?\| ok 25 - no warnings [download] Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: solution wanted for break-on-spaces (w/specifics) by perl-diddler (Chaplain) on Oct 26, 2021 at 16:36 UTC
Re: "No attempt has been made to compare performance. " Absolutely! I thought about rolling that in, but the question was already long and complex. I totally agree that should be measured and factored in, however, a few saying around that -- "premature optimization is a bane". Similar to priorities on code development: 1) get something working, 2) then look at other issues (like perf, etc). Of the solutions I've seen, both seem like they wouldn't be too different as they use similar methodology. A multi-state parser might be faster than an RE, but maybe not if written in interpreted perl code. One might have to go to 'XS' to gain speed that way.	[reply]
Re: solution wanted for break-on-spaces (w/specifics) by hippo (Archbishop) on Oct 23, 2021 at 22:57 UTC
Here are a few suggestions to make the code clearer and perhaps then garner more helpful answers: use strict use warnings use Test::More instead of trying to roll your own testing framework Avoid prototypes Avoid localising $_ Avoid capture groups which you never use Avoid P. It's fine in your own code but here it is unnecessary (or would be if you used Test::More) and is another barrier to help. Pick a formatting scheme and stick to it. Random whitespace doesn't help. In summary, help us to help you. 🦛	[reply]
Re^2: solution wanted for break-on-spaces (w/specifics) by perl-diddler (Chaplain) on Oct 26, 2021 at 17:43 UTC
re strict/warnings -- they were their and got deleted as I deleted chunks of template-prefix code...oops. Test::More is what I use for testing not random development -- Test::More is a heavy-weight solution for testing a few example RE's against lines in a file. prototypes -- avoid? only when I need to avoid them to make it work. Most of my prototypes are documentary -- in that I put them on Class-methods where they aren't used, with the expectation that the "this" ptr doesn't count. localising $_ -- I localise it if I change it's value in a sub -- I don't want to create side effects. In code cleanup I'll often replace them with "my $var"s. capture groups -- don't think there were any such that I didn't use. I use (?:...) if I don't use the result. Avoid P? If I don't use it, who would? ;-) As for being able to 'help' me -- I'm beyond help, but anyone who tried to write a regex seemed to have no problem giving me clues about things that worked or things to try.	[reply]
Re^3: solution wanted for break-on-spaces (w/specifics) by hippo (Archbishop) on Oct 26, 2021 at 21:47 UTC
Test::More is what I use for testing not random development -- Test::More is a heavy-weight solution for testing a few example RE's against lines in a file. Test::More is in Core so everyone has it and everyone who writes any significant amount of Perl has used it and is familiar with it. The same is not true of your hand-rolled testing framework so when I look at your example code I have to first analyse your testing framework not least because it might be responsible for the underlying problem your code exhibits. If Test::More is too "heavy-weight" for you then you can always use the ultra-light Test::Simple instead. prototypes -- avoid? Yes, avoid! localising $_ -- I localise it if I change it's value in a sub -- I don't want to create side effects. In code cleanup I'll often replace them with "my $var"s. capture groups -- don't think there were any such that I didn't use. I use (?:...) if I don't use the result. Here is your subroutine `txt`: `sub txt($) { local $_=shift; my (undef, undef,$txt)=m{^\s(\d+)\s+(\d+),(.)}; $txt; }` [download] It unnecessarily localizes $_ and discards 2 capture groups. Instead it could be written thus: `sub txt { shift =~ /^\s\d+\s+\d+,(.)/; return $1; }` [download] No need to mess with $_ or declare any lexical variables at all. No need for 3 capture groups when all you want is one. No need for prototypes either. Of course you are entirely free to ignore these suggestions but the harder you make it for others to read or run your code the less likely they are to want to unpick it all. 🦛	[reply] [d/l] [select]
Re: solution wanted for break-on-spaces (w/specifics) by kcott (Archbishop) on Oct 24, 2021 at 05:00 UTC
G'day perl-diddler, Testing for the number of elements is a weak test; you really need qualitative tests as well. In addition, that would have told us what you expected (and allowed better answers). Your title has "break-on-spaces" (plural) but all your tests only use single spaces. In my code below, I added an additional test to show that `q{This is simple.}` and `q{ This is simple. }` both produce the same output. I guessed that is what you would've wanted; if not, you'll need to advise us. Writing code for purely academic reasons is absolutely fine; I do it myself. Having said that, the regex you presented is unwieldy, difficult to read, and maintenance would, I suspect, be an error-prone nightmare. I've provided an alternative solution below which mostly just uses Perl's string handling functions. When you have a working regex solution, I'd be interested to see a benchmark. You indicated that you'd encountered problems with lines 4-7; and later amended that that to just 6-7. I suspect you may have run into problems with escaping, particularly `\\` and `\\\\`. Take a look at my `ok N` lines 7-10: I've just made a guess at what I thought you wanted. I've included most of your tests; you can, of course, add the remainder yourself. I didn't see the benefit of tests 8 and 9; and I thought that tests 10-15 potentially had issues with escaped backslashes so its perhaps best to wait for clarification from you on that score. Here's the code: #!/usr/bin/env perl use strict; use warnings; use Test::More; my @tests = ( [q{This is simple.}, [q{This}, q{is}, q{simple.}]], [q{ This is simple. }, [q{This}, q{is}, q{simple.}]], [q{This is "so very simple".}, [q{This}, q{is}, q{"so very simple" +.}]], [q{This "is so" very simple.}, [q{This}, q{"is so"}, q{very}, q{si +mple.}]], [q{This 'isn\'t nice.'}, [q{This}, q{'isn\'t nice.'}]], [q{This "isn\"t nice."}, [q{This}, q{"isn\"t nice."}]], [q{This 'isn\\'t nice.'}, [q{This}, q{'isn\\'t nice.'}]], [q{This "isn\\"t nice."}, [q{This}, q{"isn\\"t nice."}]], [q{This 'isn\\\\'t nice.'}, [q{This}, q{'isn\\\\'t}, q{nice.'}]], [q{This "isn\\\\"t nice."}, [q{This}, q{"isn\\\\"t}, q{nice."}]], [q{This 'is not unnice.'}, [q{This}, q{'is not unnice.'}]], [q{This "is not unnice."}, [q{This}, q{"is not unnice."}]], [q{a "bb cc" d}, [q{a}, q{"bb cc"}, q{d}]], ); plan tests => 0+@tests; for my $test (@tests) { my ($raw_str, $exp) = @$test; my $str = ($raw_str =~ /^\s(.?)\s*$/)[0]; my $got = []; my $str_len = length $str; my ($unbroken, $in_quote, $escape, $in_space) = ('', '', 0, 0); my $quote_re = qr{(['"])}; for my $str_index (0 .. $str_len - 1) { my $char = substr $str, $str_index, 1; if ($escape) { $unbroken .= $char; $escape = 0; next; } if ($char eq qq{\\}) { $escape = 1; $unbroken .= $char; next; } if ($char =~ $quote_re) { my $quote = $char; if ($in_quote) { $in_quote = '' if $in_quote eq $quote; } else { $in_quote = $quote; } $unbroken .= $char; next; } if ($char eq ' ') { next if $in_space; if ($in_quote) { $unbroken .= $char; } else { $in_space = 1; } } else { $unbroken .= $char; $in_space = 0; next; } if ($in_space) { push @$got, $unbroken; $unbroken = ''; } } push @$got, $unbroken; is_deeply($got, $exp, qq{<$raw_str>: } . join('\|', @$exp)); } [download] Here's the output: $ ./pm_11137926_str_parse.pl 1..13 ok 1 - <This is simple.>: This\|is\|simple. ok 2 - < This is simple. >: This\|is\|simple. ok 3 - <This is "so very simple".>: This\|is\|"so very simple". ok 4 - <This "is so" very simple.>: This\|"is so"\|very\|simple. ok 5 - <This 'isn\'t nice.'>: This\|'isn\'t nice.' ok 6 - <This "isn\"t nice.">: This\|"isn\"t nice." ok 7 - <This 'isn\'t nice.'>: This\|'isn\'t nice.' ok 8 - <This "isn\"t nice.">: This\|"isn\"t nice." ok 9 - <This 'isn\\'t nice.'>: This\|'isn\\'t\|nice.' ok 10 - <This "isn\\"t nice.">: This\|"isn\\"t\|nice." ok 11 - <This 'is not unnice.'>: This\|'is not unnice.' ok 12 - <This "is not unnice.">: This\|"is not unnice." ok 13 - <a "bb cc" d>: a\|"bb cc"\|d [download] — Ken	[reply] [d/l] [select]
Re^2: solution wanted for break-on-spaces (w/specifics) by LanX (Saint) on Oct 24, 2021 at 09:44 UTC
> Testing for the number of elements is a weak test; you really need qualitative tests as well. > I've included most of your tests; I think the best way to test this, is to create these strings from joining @expected arrays. By generating these arrays one can make sure to cover all edge cases. As a side product you'll define a formal grammar. Like: how are unpaired quotes to be handled? what about multiple whitespaces in a row? what about multi-line input? what about whitespace at start and end of string? It would also help testing sub-regexes individually. Crafting the strings by hand is error prone, because there are far too many cases to handle. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^3: solution wanted for break-on-spaces (w/specifics) by perl-diddler (Chaplain) on Oct 26, 2021 at 16:56 UTC
Why do I need qualitative tests? I just wanted to know if the RE's broke the line into the expected number of sections. The original test strings were read from a data file, which was several pared down representations of what one might find as attr-value fields after an initial xml or html element. How are unpaired quotes handled? That's really a bit undefined, but I thought terminating them at the end of the "string", would be most forgiving. For multi-whitespace -- I would assume shell semantics. Multi-line input -- in some larger more general case, lf+cr are both types of white space, but I didn't want to clutter my question and test cases. As for whitespace prefixes and suffixes -- in both cases, there is no "non-whitespace" before or after (respectivly) those, so they make no difference in the final answer. As I tried to stress, the program wasn't really important, it was just something I threw together over a few hours that grew by "whim", to test the regex's against the input lines in the test-data.txt file. It wasn't meant as a formal test harness.	[reply]
Re^4: solution wanted for break-on-spaces (w/specifics) by LanX (Saint) on Oct 26, 2021 at 21:27 UTC
Re: solution wanted for break-on-spaces (w/specifics) by LanX (Saint) on Oct 23, 2021 at 22:05 UTC
Your regex is messy. Using the /x flag (see `perlre`), plus linebreaks space comments would make things far more readable! ( Not only for you ...:) Consider also composing your regex from smaller parts thru interpolation of variables. Anyway from what I can spot are you treating " and ' very differently. Since your requirements are fuzzy I don't dare telling what you really want. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re: solution wanted for break-on-spaces (w/specifics) by tybalt89 (Monsignor) on Oct 24, 2021 at 18:44 UTC
#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11137926 use warnings; use Data::Dump 'dd'; my @tests = ( # q{all '- and "-quotes properly balanced}, [ q{This is simple.}, [ q{This}, q{is}, q{simpl +e.} ] ], [ q{ This is simple. }, [ q{This}, q{is}, q{simpl +e.} ] ], [ q{This is "so very simple".}, [ q{This}, q{is}, q{"so v +ery simple".} ] ], [ q{This "is so" very simple.}, [ q{This}, q{"is so"}, q{ +very}, q{simple.} ] ], [ q{This 'isn\'t nice.'}, [ q{This}, q{'isn\'t nice +.'} ] ], [ q{This "isn\"t nice."}, [ q{This}, q{"isn\"t nice +."} ] ], [ q{This 'isn\\\\'t nice.'}, [ q{This}, q{'isn\\\\'t}, + q{nice.'} ] ], [ q{This "isn\\\\"t nice."}, [ q{This}, q{"isn\\\\"t}, + q{nice."} ] ], [ q{This 'is not unnice.'}, [ q{This}, q{'is not unni +ce.'} ] ], [ q{This "is not unnice."}, [ q{This}, q{"is not unni +ce."} ] ], [ q{a "bb cc" d}, [ q{a}, q{"bb cc"}, q{d} + ] ], # q{UNbalanced '- and "-quotes at absolute end of string +}, [ q{This is "so very simple}, [ q{This}, q{is}, q{"so ver +y simple} ] ], [ q{This 'isn\'t nice.}, [ q{This}, q{'isn\'t nice.} + ] ], [ q{This "isn\"t nice.}, [ q{This}, q{"isn\"t nice.} + ] ], [ q{This 'isn\\\\'t nice.}, [ q{This}, q{'isn\\\\'t}, q +{nice.} ] ], [ q{This "isn\\\\"t nice.}, [ q{This}, q{"isn\\\\"t}, q +{nice.} ] ], [ q{This 'is not unnice.}, [ q{This}, q{'is not unnice +.} ] ], [ q{This "is not unnice.}, [ q{This}, q{"is not unnice +.} ] ], # 'what about these questionable cases?', [ q{is this"really so"simple now?}, [ q{is}, q{this"reall +y so"simple}, q{now?} ] ], [ q{is this"really so" now?}, [ q{is}, q{this"reall +y so"}, q{now?} ] ], [ q{is "really so"simple now?}, [ q{is}, q{"really so +"simple}, q{now?} ] ], [ q{is this'really so'simple now?}, [ q{is}, q{this'reall +y so'simple}, q{now?} ] ], [ q{is this'really so' now?}, [ q{is}, q{this'reall +y so'}, q{now?} ] ], [ q{is 'really so'simple now?}, [ q{is}, q{'really so +'simple}, q{now?} ] ], [ q{is really\\ so\\ simple now?}, [ q{is}, q{really\\ so +\\ simple}, q{now?} ] ], ); my $regex = qr/(?: '(?: \\. \| [^'\\] )' # single quoted string \| "(?: \\. \| [^"\\] )" # double quoted string \| ['"].* # unmatched quote \| \\. # escaped character \| \S # single non-space character )+/x; my $passcount = 0; for ( @tests ) { my ( $string, $want ) = @$_; my @out = $string =~ /$regex/g; local $" = "\0"x5; # just some array element boundary separator "@$want" eq "@out" ? $passcount++ : dd "$string => FAILED got", \@out, ' wanted ', $want; } print "$passcount of @{[scalar @tests]} passed\n"; [download] Outputs: `25 of 25 passed` [download]	[reply] [d/l] [select]
Re^2: solution wanted for break-on-spaces (w/specifics) (?>...) by LanX (Saint) on Oct 24, 2021 at 23:03 UTC
Hint: You don't need to worry about backtracking with `(?>...)` instead of `(?:...)` This will not only make your code simpler but also faster. <Reveal this spoiler or all in this thread> Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^3: solution wanted for break-on-spaces (w/specifics) (?>...) by perl-diddler (Chaplain) on Oct 26, 2021 at 16:27 UTC
BTW, on the no-backtracking -- that was a later addition one of about 10-15 alterations in the statement I tried over time.	[reply]
Re^2: solution wanted for break-on-spaces (w/specifics) by perl-diddler (Chaplain) on Oct 26, 2021 at 16:25 UTC
Your regex was perfect. FWIW, I put it in my original prog (some bugs fixed in the prog), as the 2nd regex in the regex array. The reason I had them and the outputs in arrays was to compare several RE's. But I ended up with just the one as it passed the most cases. So lines for cases 3 and 4 (w/4+5 being the two that didn't pass in the regex I originally posted) `ResByLn:{ln=>3, wanted=>4, got=>[4, 4]},[" p ", " p "] ResByLn:{ln=>4, wanted=>2, got=>[3, 2]},["FAIL:<4>", " p "]` [download] The gots were count I got from the regex's, with your RE being in the 2nd position. The last brackets contained the p/f for each regex against that statement. So yours were 'p' straight down the 2nd column. Thanks. I had spaces in the earlier revisions of the re's, but I wasn't sure I had the 'x' flag applied to the sub-re's that needed them. I guess each outer layer of the RE's flags get propagated to inner RE's. I'm not sure if you were asking a question about your third group above where it you wrote: " 'what about these questionable cases?',"? I'm not sure what is questionable about them. In my use case, neither 'q{}' nor '?' have special meaning. Only the quotes and backslash were meta chars. So in the first line, I see 3 fields in both of the 1st 2 cases: `[ q{is this"really so"simple now?}, [ q{is}, q{this"really so"simple}, + q{now?} ] ], ^ ^ ^ +^` [download] Both of expressions had 2 breaks -- yielding 3 parts in each. Does that make sense? One rule I forgot to list, though, at least your example handled it as expected, was what to do with overlapping quotes, and not making a quote of a different type have 'meta' properties. I.e.: `this "is a' test" of weird' stuff` [download] I may be wrong but I don't think most here would split that into 3 parts, as most of us are used to meta-properties of characters being disabled or modified within quotes, so the single quote above wouldn't start a quoted sub-expression overlapping with double quoted part. That would effectively make "is a' test" of weird' all 1 "word" as all the spaces are between quotes of some type. While that would be "a" way of interpreting overlapping quoted sections, I don't know how expected or useful it would be. Need to study your example and some others, but wanted to make some response. Just that about 3-4 other things cropped up and need attention just after I posted this...	[reply] [d/l] [select]
Re: solution wanted for break-on-spaces (w/specifics) by vr (Curate) on Oct 23, 2021 at 23:06 UTC
use strict; use warnings; use feature 'say'; # use Regexp::Common; # ^^^ Not used. I'm so lazy, I just peeked at $RE{quoted} # to construct the "$quoted" expression below, by slightly # modifying it (see "$") to satisfy the third clause. # And actually 2nd test case below is to test how it works, # it seems there's not a similar one among your 18. my $quoted = qr/ (?:(?\| (?:(?<!\\)\")(?:[^\\\"](?:\\.[^\\\"]))(?:\"\|$)\| (?:(?<!\\)\')(?:[^\\\'](?:\\.[^\\\']))(?:\'\|$) )) /x; my $re = qr/(?:$quoted\|[^ ])+\K(?: \|$)/; my @tests = ( q(This 'isn\'t nice.'), q(This 'isn\'t nice.), q(This \"isnt unnice.\"), ); for my $t ( @tests ) { say "[$_]" for split $re, $t; } __END__ [This] ['isn\'t nice.'] [This] ['isn\'t nice.] [This] [\"isnt] [unnice.\"] [download] 10 minutes update: aargh, added negative look-behind to cover your 14th case (and added my third). Maybe there are more to add. Further: it's more tricky, 6 (and 7) are split in 3, but wrong, groups. Will look into that later. False alarm? Will see yet later :) Next morning update. As LanX pointed out, negative look-behind for just a single backslash isn't enough. Then to save this answer (I like how the "keep" `\K` meta-character helps in regexp for `split`, it's kind of interesting), maybe it's easier to revert `$quoted` to as it was borrowed from `$RE{quoted}`, and tweak the `$re`: `my $quoted = qr/ (?:(?\| (?:\")(?:[^\\\"](?:\\.[^\\\"]))(?:\"\|$)\| (?:\')(?:[^\\\'](?:\\.[^\\\']))(?:\'\|$) )) /x; my $re = qr/ (?: (?:\\\\)+ \| (?:\\[^ ]) \| $quoted \| [^ ] )+ \K (?: \ \| $ ) /x;` [download] I hope it works now, my 1st attempt at this "update" was broken (see, but better not -- nothing interesting -- below. Sorry for the mess.). But further, it's unclear whether to split on escaped space, or several spaces in a row. Read more... (734 Bytes) And later (final(?)) update: Sigh... damn lack of practice. So this: `my $quoted = qr/ (?:(?\| (?:\")(?:[^\\\"](?:\\.[^\\\"]))(?:\"\|$) \| (?:\')(?:[^\\\'](?:\\.[^\\\']))(?:\'\|$) )) /x; my $re = qr/ (?: (?:\\.)+ \| $quoted \| [^ \\"']+ )* \K (?: \ \| $ )+ /x; # and later: my $got = [ split $re, $str ];` [download] passes all tests in LanX's later answer except #2 and is somewhat optimized. About test #2: consensus is "the brief is unclear", must `split`-like behaviour generate an empty leading field for #2? Expression to split on is definitely not missing nor space literal. If, nevertheless, it must not (as my solution does, failing #2), then my bad, but still, yeah, this regexp is "working" and can be used to literally `split` on. :)	[reply] [d/l] [select]
Re^2: solution wanted for break-on-spaces (w/specifics) by LanX (Saint) on Oct 24, 2021 at 00:30 UTC
I'm not sure about this `(?:(?<!\\)\")` I read it as doublequote which is not preceded by backslash But what about an escaped backslash `\\"` or two `\\\\"` ... ? I'd rather try something like (Untested pseudocode) `s/^(?:$escaped\|$quoted\|\S)\K\s+/\n/g` and `$escaped = qr/\\./; $quoted = qr/ (['"]) # start (?: $escaped \| [^\1] ) # inside \1 # end, probably \g-1 better /x;` [download] NB: I didn't cover the case of unclosed quotes, which is unclear anyway. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} update tested - fails - good night! :) update see Re: solution wanted for break-on-spaces (w/specifics) for "working" example	[reply] [d/l] [select]
Re: solution wanted for break-on-spaces (w/specifics) by LanX (Saint) on Oct 24, 2021 at 12:00 UTC
in continuation to Re^2: solution wanted for break-on-spaces (w/specifics): Please note how readable and maintainable the regexes become now! This solves AnomalousMonk's test case here but is easily adaptable to various interpretations. (I disagree in the case of unbalanced quotes, I'd rather ignore them. For this to happen drop the $-branch commented with "EOL".) use v5.12; use warnings; use Test::More; my $escaped = qr/\\./; my $quoted = qr/ (['"]) # --- start-quote (?: # --- inside $escaped # any escape-pair \| . # anything else )? # non-greedy (?: # --- end \g{-1} # same quote \| $ # EOL ends missing pair ) /x; my $re = qr/ (?: $escaped # any escape pair \| $quoted # any quoted string \| \S # any none whitespace )+ # at least once /x; my $str = q{This "is so" very simple.}; my @tests = ( # q{all '- and "-quotes properly balanced}, [ q{This is simple.}, [ q{This}, q{is}, q{simpl +e.} ] ], [ q{ This is simple. }, [ q{This}, q{is}, q{simpl +e.} ] ], [ q{This is "so very simple".}, [ q{This}, q{is}, q{"so v +ery simple".} ] ], [ q{This "is so" very simple.}, [ q{This}, q{"is so"}, q{ +very}, q{simple.} ] ], [ q{This 'isn\'t nice.'}, [ q{This}, q{'isn\'t nice +.'} ] ], [ q{This "isn\"t nice."}, [ q{This}, q{"isn\"t nice +."} ] ], [ q{This 'isn\\\\'t nice.'}, [ q{This}, q{'isn\\\\'t}, + q{nice.'} ] ], [ q{This "isn\\\\"t nice."}, [ q{This}, q{"isn\\\\"t}, + q{nice."} ] ], [ q{This 'is not unnice.'}, [ q{This}, q{'is not unni +ce.'} ] ], [ q{This "is not unnice."}, [ q{This}, q{"is not unni +ce."} ] ], [ q{a "bb cc" d}, [ q{a}, q{"bb cc"}, q{d} + ] ], # q{UNbalanced '- and "-quotes at absolute end of string +}, [ q{This is "so very simple}, [ q{This}, q{is}, q{"so ver +y simple} ] ], [ q{This 'isn\'t nice.}, [ q{This}, q{'isn\'t nice.} + ] ], [ q{This "isn\"t nice.}, [ q{This}, q{"isn\"t nice.} + ] ], [ q{This 'isn\\\\'t nice.}, [ q{This}, q{'isn\\\\'t}, q +{nice.} ] ], [ q{This "isn\\\\"t nice.}, [ q{This}, q{"isn\\\\"t}, q +{nice.} ] ], [ q{This 'is not unnice.}, [ q{This}, q{'is not unnice +.} ] ], [ q{This "is not unnice.}, [ q{This}, q{"is not unnice +.} ] ], # 'what about these questionable cases?', [ q{is this"really so"simple now?}, [ q{is}, q{this"reall +y so"simple}, q{now?} ] ], [ q{is this"really so" now?}, [ q{is}, q{this"reall +y so"}, q{now?} ] ], [ q{is "really so"simple now?}, [ q{is}, q{"really so +"simple}, q{now?} ] ], [ q{is this'really so'simple now?}, [ q{is}, q{this'reall +y so'simple}, q{now?} ] ], [ q{is this'really so' now?}, [ q{is}, q{this'reall +y so'}, q{now?} ] ], [ q{is 'really so'simple now?}, [ q{is}, q{'really so +'simple}, q{now?} ] ], ); plan tests => 0+@tests; for my $test (@tests) { my ($str, $exp) = @$test; my $got; push @$got, $& while ($str =~ /$re/g); is_deeply($got, $exp, qq{<$str>: } . join('\|', @$exp)); } [download] -- mode: compilation; default-directory: "d:/tmp/pm/" -*- Compilation started at Sun Oct 24 14:00:21 C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/break_not_quoted.pl 1..24 ok 1 - <This is simple.>: This\|is\|simple. ok 2 - < This is simple. >: This\|is\|simple. ok 3 - <This is "so very simple".>: This\|is\|"so very simple". ok 4 - <This "is so" very simple.>: This\|"is so"\|very\|simple. ok 5 - <This 'isn\'t nice.'>: This\|'isn\'t nice.' ok 6 - <This "isn\"t nice.">: This\|"isn\"t nice." ok 7 - <This 'isn\\'t nice.'>: This\|'isn\\'t\|nice.' ok 8 - <This "isn\\"t nice.">: This\|"isn\\"t\|nice." ok 9 - <This 'is not unnice.'>: This\|'is not unnice.' ok 10 - <This "is not unnice.">: This\|"is not unnice." ok 11 - <a "bb cc" d>: a\|"bb cc"\|d ok 12 - <This is "so very simple>: This\|is\|"so very simple ok 13 - <This 'isn\'t nice.>: This\|'isn\'t nice. ok 14 - <This "isn\"t nice.>: This\|"isn\"t nice. ok 15 - <This 'isn\\'t nice.>: This\|'isn\\'t\|nice. ok 16 - <This "isn\\"t nice.>: This\|"isn\\"t\|nice. ok 17 - <This 'is not unnice.>: This\|'is not unnice. ok 18 - <This "is not unnice.>: This\|"is not unnice. ok 19 - <is this"really so"simple now?>: is\|this"really so"simple\|now? ok 20 - <is this"really so" now?>: is\|this"really so"\|now? ok 21 - <is "really so"simple now?>: is\|"really so"simple\|now? ok 22 - <is this'really so'simple now?>: is\|this'really so'simple\|now? ok 23 - <is this'really so' now?>: is\|this'really so'\|now? ok 24 - <is 'really so'simple now?>: is\|'really so'simple\|now? Compilation finished at Sun Oct 24 14:00:21 [download] Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]

update

update