foobar1977 has asked for the wisdom of the Perl Monks concerning the following question:

Hey Peeps

After a little guidance on the most effective way to split a string based on multiple delimiters.

For the most part, splitting on space would suffice however, the input also allows for quoted stuff, which could/will contain spaces.

Essentially it is a space separated list of commands, however any of those commands taking arguments must be wrapped in double quotes.

'command1 command2 command3 "command4 --some-arg arg --some-other-arg 2" command5'

I'm struggling to see the most efficient way to split this into an array without jumping though hoops and I know there must be a way to do it.

Any pointers greatly appreciated :)

Replies are listed 'Best First'.
Re: splitting on multiple delimiters
by toolic (Bishop) on Jun 09, 2008 at 17:14 UTC
    I think the core module, Text::ParseWords, does what you want:
    use strict; use warnings; use Data::Dumper; use Text::ParseWords; my $str = 'command1 command2 command3 "command4 --some-arg arg --some- +other-arg 2" command5'; my @words = shellwords($str); print Dumper(\@words);

    prints:

    $VAR1 = [ 'command1', 'command2', 'command3', 'command4 --some-arg arg --some-other-arg 2', 'command5' ];
Re: splitting on multiple delimiters
by moritz (Cardinal) on Jun 09, 2008 at 17:00 UTC
    use strict; use warnings; my $str = 'command1 command2 command3 "command4 --some-arg arg --some- +other-arg 2" command5'; for (split m{(\s+|"[^"]*")}, $str) { print $_, $/; }

    That works, but produces a few strings that only contain whitespaces - you can filter them in a separate pass, for example with grep.

    The other solution is match your desired output, not the delimiter.

    Update: version with filtering:

    use strict; use warnings; my $str = 'command1 command2 command3 "command4 --some-arg arg --some- +other-arg 2" command5'; for (split m{(\s+|"[^"]*")}, $str) { next if m/^\s/; next unless length $_; print "<$_>\n"; }
Re: splitting on multiple delimiters
by johngg (Canon) on Jun 09, 2008 at 21:14 UTC
    This solution does the split on whitespace passed into a map which maintains a state engine, agregating the commands and arguments and passing an undef onwards if within quotes; a grep is then used to get rid of the undefs. It is a little like punch_card_don's solution in concept. It will not cope with escaped double quotes as it stands.

    #!/usr/bin/perl -l # use strict; use warnings; my $string = q{command1 command2 command3 "command4 --some-arg arg --some-other- +arg 2" command5}; my $inQuotes = 0; my $agregator = q{}; my @cmds = grep { defined } map { if ( $inQuotes ) { if ( m{"$} ) { s{"}{}; $agregator .= qq{ $_}; $inQuotes = 0; $agregator; } else { $agregator .= qq{ $_}; undef; } } elsif ( m{^"} ) { s{"}{}; $inQuotes = 1; $agregator = $_; undef; } else { $_; } } split m{\s+}, $string; print for @cmds;

    The output.

    command1 command2 command3 command4 --some-arg arg --some-other-arg 2 command5

    I hope this is of interest.

    Cheers,

    JohnGG

Re: splitting on multiple delimiters
by BrowserUk (Patriarch) on Jun 09, 2008 at 18:15 UTC
Re: splitting on multiple delimiters
by punch_card_don (Curate) on Jun 09, 2008 at 19:48 UTC
    Sometimes you want to explicitly show the logic used to deconstruct and reconstruct things:
    $str = 'command1 command2 command3 "command4 --some-arg arg --some-oth +er-arg 2" command5'; @words = split(/\s/, $str); $arg_flag = 0; for $i (0 .. $#words) { if ($words[$i] =~ /\"/) { if ($arg_flag == 0) { #starting a new command with args $arg_flag = 1; $words[$i] =~ s/\"//; $this_command = $words[$i]; } else { #ending a new command with args $words[$i] =~ s/\"//; $this_command .= " ".$words[$i]; push (@commands, $this_command); $arg_flag = 0; } } else { if ($arg_flag == 0) { #any old command push (@commands, $words[$i]); } else { # an arg in a command being built $this_command .= " ".$words[$i]; } } }
    It's clunky, but it works, and it preserves the logic for future generations to understsand.



    Forget that fear of gravity,
    Get a little savagery in your life.
Re: splitting on multiple delimiters
by GrandFather (Saint) on Jun 09, 2008 at 23:59 UTC

    This looks like a job for Text::xSV. Consider:

    use strict; use warnings; use Text::xSV; my $line = 'command1 command2 command3 "command4 --some-arg arg --some +-other-arg 2" command5'; open my $fh, '<', \$line; my $parser = Text::xSV->new (fh => $fh, sep => ' '); my @params = $parser->get_row (); print join "\n", @params;

    Prints:

    command1 command2 command3 command4 --some-arg arg --some-other-arg 2 command5

    Perl is environmentally friendly - it saves trees
Re: splitting on multiple delimiters (2 ways)
by tye (Sage) on Jun 10, 2008 at 06:48 UTC

    Rolled two ways to do it, in case just one page of code might be enlightening.

    # Simple command argument extraction: sub grabArgs { my( $cmd )= @_; my @args; while( $cmd =~ m{ (?: ^ | \G ) # Don't skip any characters \s* # Skip leading whitespace (?: ([^\s"]+) # Non-space non-quotes are simple argument +s, capture | (") # Double-quoted string, capture it as $2 ( # Capture the inside as $3 (?: # Zero or more of the following \\. # Use \\ and \" to get \ and " inside quot +es | [^\\"]+ # Not \ nor " means just include it )* )" | ($) # End of string as $4 ) (?! \S ) # Arguments must be space-separated }xgc ) { return \@args if defined $4; push @args, $2 ? $3 : $1; $args[-1] =~ s/\\(.)/$1/g if $2; } substr( $cmd, pos($cmd), 0, "<ERROR>" ); die "Invalid command specification near '<ERROR>': $cmd\n"; } # Pretty much what Text::Shellwords does: sub parseArgs { my( $cmd )= @_; my @args; while( $cmd =~ m{ (?: ^\s* | \G ) (?: ($) # $1: End of string | (\s+) # $2: Whitespace | ([^\s\\"']+) # $3: Nothing special | '( [^']* )' # $4: 'string' | "( (?: \\. | [^\\"]+ )* )" # $5: "string" | \\ (.) # $6: \x escape ) }xgc ) { my( $end, $sp, $ns, $sq, $dq, $esc )= ( $1, $2, $3, $4, $5, $6 ); return \@args if defined $end; push @args, '' if ! @args; if( $sp ) { push @args, ''; } elsif( defined $ns ) { $args[-1] .= $ns; } elsif( defined $sq ) { $args[-1] .= $sq; } elsif( defined $dq ) { $dq =~ s/\\(.)/$1/g; $args[-1] .= $dq; } elsif( defined $esc ) { $args[-1] .= $esc; } else { die "Buggy code"; } } substr( $cmd, pos($cmd), 0, "<ERROR>" ); die "Invalid command specification near '<ERROR>': $cmd\n"; }

    Interestingly, I had to go back to perl 5.005_03 in order to find a Perl that this works on. It appears to tickle bugs in Perl for several 5.8 and 5.6 versions. I might test on 5.10 tomorrow when I have a copy handy again.

    - tye        

Re: splitting on multiple delimiters
by foobar1977 (Novice) on Jun 10, 2008 at 08:00 UTC

    Wow... thank you all for your replies.

    I cant believe I missed Text::ParseWords though, that is probably the best solution for me. The other examples are food for thought though, especially the solution posted by tye :D

    Thanks once again for your help