87C751 has asked for the wisdom of the Perl Monks concerning the following question:

(or "How about a round of golf?")

Oh wise and learned monks:

Given an arbitrary string of space-separated words, 'tis easy to use @list = split(' ',$string); to break the string into words. But assume the string has some quote-delimited substrings, as in

one "two three" four five "six seven eight" nine
and I want the end result list to be
@list = ('one', 'two three', 'four', 'five', 'six seven eight', 'nine');
Can this be done with a single regex?

Replies are listed 'Best First'.
Re: In need of a stupid regex trick
by jweed (Chaplain) on Jan 04, 2004 at 20:59 UTC
    use Text::ParseWords; @list = shellwords($string);


    Who is Kayser Söze?
    Code is (almost) always untested.
Re: In need of a stupid regex trick
by CountZero (Bishop) on Jan 04, 2004 at 21:05 UTC
    I don't know about a regex, but Text::CSV_XS can do it:

    use strict; use Text::CSV_XS; use Data::Dumper; my $csv = Text::CSV_XS->new({sep_char=>' '}); $csv->parse('one "two three" four five "six seven eight" nine'); my @columns = $csv->fields(); print Dumper(@columns);

    The result is:

    $VAR1 = 'one'; $VAR2 = 'two three'; $VAR3 = 'four'; $VAR4 = 'five'; $VAR5 = 'six seven eight'; $VAR6 = 'nine';

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: In need of a stupid regex trick
by demerphq (Chancellor) on Jan 04, 2004 at 21:16 UTC

    Something like

    my @list=$str=~/("[^"]*"|\S+)/g

    will do, but it doesnt handle escaping, and alas im on a box without perl installed so i havent tested it.


    ---
    demerphq

      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi


      This seems to work but it still needs something extra besides a regex :(

      my $str='one "two three" four five "six seven eight" nine'; my @list=grep defined, $str=~/"([^"]*)"|(\S+)/g; print join "\n", @list;

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      Almost, but it keeps the " around the strings.

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

        my @list = grep defined, $str=~/"([^"]*)"|(\S+)/g;

        Oops. (still untested...)

        my @list=$str=~/((?<=")[^"]*(?=")|\S+)/g

        ---
        demerphq

          First they ignore you, then they laugh at you, then they fight you, then you win.
          -- Gandhi


Re: In need of a stupid regex trick
by Zaxo (Archbishop) on Jan 04, 2004 at 21:24 UTC

    Here's a regex that almost works, but it leaves empty shards. Hence, grep...

    local $_= q(one "two three" four five "six seven eight" nine); my @foo = grep {$_} /\G(?:(\w+)\s*)|(?:"([^"]*)"\s*)/g; local $,="\n"; print @foo, $/;
    It works for your data, but I suspect it is very fragile.

    After Compline,
    Zaxo

Re: In need of a stupid regex trick
by ysth (Canon) on Jan 04, 2004 at 21:43 UTC
    It's easier to do this with m//g than split. @list = $string =~ /"[^"]+"|\S+/g @list = grep defined, $string =~ /"([^"]*)"|(\S+)/g; Update: don't leave quotes on; allow empty string ""

    Doesn't handle backslashes before " specially; if there is an unmatched " in the input, you'll get one it returned as part of an element.

      perl -wle '@l = split(/(?:"([^"]*)"|\s+)/, $ARGV[0]);$,="\t";print @l;' 'one "two three" four "five"'
        Nice.

        Slight tweaks to require space around "quoted string" (which you may or may not want) and remove undef entries. Update: and remove empty entries.

        perl -wle'@list = grep defined && length, split /(?:(?<!\S)"([^"]*)"(? +!\S))|\s+/, shift;print for @list' 'one "two three" four "five"'
        perl -wle'@list = grep defined, split /(?:(?<!\S)"([^"]*)"(?!\S))|\s+/ +, shift;print for @list' 'one "two three" four "five"'
Re: In need of a stupid regex trick
by oha (Friar) on Jan 04, 2004 at 21:24 UTC
    IOW, you wish to split iff there are an even number of quotes since the start of string.

    /(?<^[^"]*("[^"]*"[^"])*) / does not work: perl states that

    Variable length lookbehind not implemented before HERE mark in regex

    lookahead works, but. and the following code:

    my @a = ( 'simple', 'keep simple', 'a "bit more" difficult', 'an "increasing more" "and more" here'); foreach $_ (@a) { s/ (?=("[^"]*"[^"])*[^"]+$)/ | /g; print "$_\n"; }
    produces
    simple keep | simple a | "bit more" | difficult an | "increasing more" | "and more" | here

    unfortunately, the regex "confuses" split and it's not usable, at least i was not able to. but why? :)

Re: In need of a stupid regex trick
by Roger (Parson) on Jan 04, 2004 at 22:36 UTC
    The following solution is based on an example of split with capture I posted earlier...
    use strict; use warnings; use Data::Dumper; my $str = 'one "two three" four five "six seven eight" nine'; my @words = map { $_ || () } split /"(\\"|.*?)"|\s+/, $str; print Dumper(\@words);
    And the output is -
    $VAR1 = [ 'one', 'two three', 'four', 'five', 'six seven eight', 'nine' ];
Re: In need of a stupid regex trick
by David Caughell (Monk) on Jan 04, 2004 at 22:38 UTC

    My instincts were that it can be done, since regex's are where perl really shines.

    I'm still learning the language, so I've decided to give this a shot just for fun. Jweed's elegant solution (and whoever put it out on cpan) is the best if you're doing this sort of thing for any other purpose than learning the language, though.

    Now on to something not quite so elegant:

    #!/usr/bin/perl -w use strict; my $string = 'one "two three" four five "six seven eight" nine'; my @list = split / [ ]" #opening quotes | #or "[ ] #closing quotes | [ ] #a space (?!.*?\w") # that's not before any number of characters followed # by a closing quote (allows EOL at quote) /x, $string;

    OOC, is it possible to put a character group (and quantify it with a * + ? or {} ) into a look-ahead match?

    Crap, this isn't quite there yet. The four and five are sticking together. If anyone has suggestions on how to fix that, I'd appreciate it.


    $scratchpad_public = 0 unless $scratchpad;

      is it possible to put a character group (and quantify it with a * + ? or {} ) into a look-ahead match?
      Sure it is. Anything you can do in a plain match, you can do in a lookahead match.

      For lookbehind, it's a different matter: you can only use fixed length lookbehind, so quantifiers (like * and +), and varied-length alternatives, are out. BTW if all alternatives have the same length, it is allowed, as in:

      $_ = q[There's food at the bar.]; while(/(?<=foo|bar)(\S+)/g) { print "$1\n"; }
      Also, you need to be nice to embedded strings such as "two three ". Currently, your regex calls the second quotation mark an "opening quote" and throws it away. If these spaces are crucial, then that's no good.


      Who is Kayser Söze?
      Code is (almost) always untested.
Re: In need of a stupid regex trick
by pg (Canon) on Jan 04, 2004 at 22:12 UTC
    use Data::Dumper; use strict; sub my_split { local $_ = shift; my $abc; s/(('|").*?\2)/ ($abc = $1) =~ s!\s+!\cA!g; $abc /ge; #!" grep{s/\cA/ /g, $_}split/\s+/; } my @pieces = my_split q/one "two three" four "five six seven" eight/; print Dumper(\@pieces);
Re: In need of a stupid regex trick
by Anonymous Monk on Jan 05, 2004 at 21:00 UTC

    My own 2c. You may like it if you like that the regex itself does all the work...

    my $s = ' one "two three" four five "six seven eight" nine'; my @w = $s =~ / \s* #strip whitespace outside of paren "blocks" (?:"(?{local $openq=1}))? #note fact of open-quote without storing ((??{$openq ? '[^"]*' : '\w+'})) #store block (?:(??{$openq ? '"' : '\b'})) #gobble up closeq or word-boundary(rea +lly nop in this case) /gx; local $" = ':'; print "@w\n";

    ,welchavw

      uhhh, that AM post there is mine own...so if you'd like to --, then you can hit this node instead!

      ,welchavw