Splitting up quoted/escaped command line arguments

Tommy has asked for the wisdom of the Perl Monks concerning the following question:

I've got a requirement to be able to pass a command and its arguments to exec() in indirect object notation (perldoc -f exec). As documented (if I'm understanding correctly), the indirect object syntax avoids the shell, and returns nothing, which is exactly what I need it to do. An example of the syntax would be (according to perldoc):

exec { $command } @args

I have my input command and its arguments tucked away as a single string in a database row value. An example would look like this:

/usr/local/bin/ssh %s '/opt/something/bin/somebinary || echo "Could not execute somebinary. Confirm that the host has FOO attached and try running somebinary manually on the host to troubleshoot"'

It's not my database, so I don't get to choose or change how commands and arguments are stored. This is what I have to work with. So I have to figure out how to split up that line intelligently, whether in one pass or more, so that my command and its arguments are passed to exec exactly (or as close to it) as they would be if the command was typed into a console.

I seek the wisdom of the Perl Monks in this endeavor... The solution has to be robust enough that it is fool-proof as possible; literally, robust enough to cover the corner cases that joe user may put into the database. My idea so far looks like this (and it doesn't work because it doesn't preserve the order of the arguments). The unpolished nature of this solution is frankly embarrassing to post, and given enough time I could work this out on my own but it looks like a fun thought exercise. Not fully tested:


my ( $command, $argstring ) = split / /, $string, 2;

# single-quoted args like 'red wagon'
my ( @args ) = $argstring =~ s/[[:space:]]+'(.*?)'//g; 

# double-quoted args, where the closing quote isn't preceded by "\"
push @args, $argstring =~ s/[[:space:]]+"(.*?[^\]*)"//g; 

# args that aren't quoted at all, like --verbose
push @args, split / /, $argstring;
[download]

Now as I said, this doesn't preserve the order of args. It also falls on its face if the arg is something like --fullname="Tommy Butler". It also doesn't work for quoted strings within quoted strings. I thought of trying to use Getopt::Long, but it doesn't help me when my command looks like the example I provided above.

Suggestions? TIA!

UPDATE:

This shows promise: https://metacpan.org/pod/Argv -- but it's so complex that it seems like total overkill for my needs. This is just a series of simple splits and/or regexen.

UPDATE 2

I've come up with this (code removed cuz it wuz broked) at the suggestion of kennethk. I haven't been able to break it so far. Can you?

UPDATE 3

ARGGGG I just got it to break on my own test scenario of --name="tommy butler". back to the drawing board.

UPDATE 4

This one works: GIST -- previous broken code removed from post. It turns out Eily was right; it was a harder problem than I thought. Post facto: If I knew how to pull matches out of a regex within the regex itself and later on in the regex match based on the previous inner capture, I could eliminate the nested if/else on lines 40 thru 49, but I'll work on that later. Thanks everyone!

Tommy
A mistake can be valuable or costly, depending on how faithfully you pursue correction

Comment on Splitting up quoted/escaped command line arguments Select or Download Code

Replies are listed 'Best First'.
Re: Splitting up quoted/escaped command line arguments by Tanktalus (Canon) on Feb 12, 2014 at 00:33 UTC
Suggestion: Skip it all, and call system. Ok, I know. You said that the shell involves a lot of overhead, and it's significant. However, compared to the overhead of ssh and the remote shell and the actual code you're calling on the remote side? Maybe not so much. Would have to benchmark the full thing. Also, all the work you're doing in pretending to be the shell? You're writing it in perl instead of C. Not sure that'll be a win. The only overhead you're saving is re-initialisation of the C runtime library, and that will get partially eaten up by the fact you're parsing in perl vs the shell in C. Remember the shell parser has two advantages over code you might use: 1. it's written in C, and likely been overoptimised over the years, and 2. it's correct by definition: bugs have been worked out over the last few decades, and your sysadmins are used to those bugs that remain (thinking of them as features, like the method to escape single quotes). If you aren't bug-for-bug compatible with the shell, it'll be you that is wrong, not the sysadmin. You can't win that game, only triage it until the number of bug reports coming in over your misparsing slow to a manageable crawl. There are other ways to mitigate this. Some of them are crazier than others. One is to reduce your fork overhead. If your perl process takes up a lot of memory, when you fork and exec, regardless of what it is, that's a lot of CoW memory to free up each time. I've seen the author of AnyEvent::Fork create a small template process that he shunts the work of forking off to. That process is kept as small as possible, and then is instructed by the parent as to what it should fork and exec. He claims a speed up on that. Another one is to leave the shell open. Basically, open a shell, and run your command there, but leave it open. Something like this: `open my $shell, '\|-', '/bin/sh' or die "Can't run the shell: $!"; # pi +ck whatever shell you like here. for my $cmd (@cmds) { print $shell "( $cmd > /dev/null 2> /dev/null )"; }` [download] You can do a bit more here, for example if you use IPC::Open2 or IPC::Open3, you can extract stdout and stderr. You can then encode the return code in the shell output as well. Or, with a bit more work, you can do `; echo $rc >&3` inside there, requiring you to have filehandle 3 opened for it to print the output to so you can receive it. Notice that I'm using parenthesis here in an attempt to limit environment changes, including current working directory. You can run multiple commands through this shell, eliminating all the startup costs, but maintaining the shell's ability to parse the commands. There is some risk of bleed-through (a command with mismatched parenthesis can ruin your whole day), but you can blame your sysadmins for those :) On the other hand, each command here is serial, though you could have multiple shells open for a job queue to run them in parallel if you so wanted. Note that something like AnyEvent::Util::run_cmd can make this less difficult to handle, IMO. YMMV. There are likely other similar, or even better, options on CPAN. Finding them and figuring them out is left as an excersise for the reader :) Note that I currently have a system that runs multiple ssh's in parallel to multiple co-located servers. I had planned to figure out a way to re-use ssh connections as a potential performance bottleneck because I'm calling ssh thousands of times. But I've not gone down that road in the four years I've been doing this because, quite simply, the performance hit has not been significant enough to warrant time spent on that. (That could be very different if I was ssh'ing over VPN to another continent. I don't know. But that's not possible for my current job, so I'm unconcerned with it.) At this point, I might save 5-10 seconds over the course of a 20-hour job. Probably not even that much.	[reply] [d/l] [select]
Re: Splitting up quoted/escaped command line arguments by Eily (Monsignor) on Feb 11, 2014 at 18:43 UTC
This is just a series of simple splits and/or regexen. This probably isn't as easy as you think it is. There are already more than one way to make your code "fall on its face". If the path to the program the user wants to call has spaces, you'll end up breaking it in two, whether the user added quotes or backslashes where needed. But maybe the users are supposed to run a list of carefully chosen programs. If an argument is "the 'simple' solution", it will be broken into `('simple', 'the solution')`. Even if you checked that a " wasn't preceded by a backslash correctly (you wrote `[^\]"`, which should be a syntax error because the \ has to be escaped, and means that the " may or may not be preceded by something else than a backslash), it would failed with something like `"\\"`. So if you find one, do use a module that does the job for you, but I'm afraid I don't know any, and as far as I looked, argv did not seem to be what you are looking for. I may be wrong on that point.	[reply] [d/l] [select]
Re^2: Splitting up quoted/escaped command line arguments by Tommy (Chaplain) on Feb 11, 2014 at 18:53 UTC
Sorry for forgetting the \\ escape. The command will never have spaces in it--only the arguments. I too took a look at the Argv module and tried it out. It doesn't satisfy the needs I have: `$ perl -MArgv -MData::Dumper -E 'say Data::Dumper::Dumper [ Argv->new( + "/usr/local/bin/ssh %s \"/opt/something/bin/somebinary \|\| echo \"Cou +ld not execute somebinary.\"" )->argv ]' $VAR1 = [ '/usr/local/bin/ssh %s "/opt/something/bin/somebinary \|\| ech +o "Could not execute somebinary."' ];` [download] Tommy A mistake can be valuable or costly, depending on how faithfully you pursue correction	[reply] [d/l]
Re: Splitting up quoted/escaped command line arguments by kennethk (Abbot) on Feb 11, 2014 at 18:50 UTC
Okay, so given all the qualifiers about how this cannot be robust and that there are all sorts of potential security implications (which is probably why Argv is so complex), you could take 1 of two approaches: State machine. Crawl the string character by character, keeping track of things like if you opened with a single quote, last saw an equals sign or backslash... Start out with a `for (split //) {...`, and stash the characters on a buffer. The buffer could be either an independent scalar or `$args[-1]`, depending on taste. Regular expression with backreferences. This is more challenging, because regular expressions aren't really intended to split up an entire string, but rather grab substrings. Expressions like `"[^"]*(?<!\\)"` to grab everything between two unescaped double quotes could be helpful, but remember if the command were `echo "He said, \"How are you?\""`, the intended output from your process would be `($command, @args) = ('echo', 'He said, "How are you?"')`, which requires removing the surrounding quotes as well as unescaping. Note as well there is already a bug with `my ( $command, $argstring ) = split / /, $string, 2;` in the case where the executable path contains a space. I personally would go with the state machine; logic is more natural and quiet failures are less common in my experience. It will still require the kind of unescaping discussed with 2). Actually, I would probably just use a string exec, since someone already did a lot of work developing a shell, but that's not on spec. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re^2: Splitting up quoted/escaped command line arguments by Tommy (Chaplain) on Feb 11, 2014 at 18:57 UTC
Without knowing the name for it, I have already started trying to put together a state machine. I called it a "peel off" approach where I look through the string and peel things off one at a time, making sure to note and handle quoted things when I encounter them. I haven't got very far with it yet--just a few minutes working on the idea. Tommy A mistake can be valuable or costly, depending on how faithfully you pursue correction	[reply] [d/l]
Re^3: Splitting up quoted/escaped command line arguments by Tommy (Chaplain) on Feb 11, 2014 at 19:31 UTC
OK. This mixture of approaches seems to be working so far: I haven't been able to break it yet. Can anyone break this? (Please see UPDATE 2 to the OP) Tommy A mistake can be valuable or costly, depending on how faithfully you pursue correction	[reply]
Re^4: Splitting up quoted/escaped command line arguments by choroba (Cardinal) on Feb 11, 2014 at 20:46 UTC
Re^5: Splitting up quoted/escaped command line arguments by Tommy (Chaplain) on Feb 11, 2014 at 21:25 UTC
Some notes below your chosen depth have not been shown here
Re: Splitting up quoted/escaped command line arguments by choroba (Cardinal) on Feb 11, 2014 at 18:10 UTC
It seems the arguments are already properly quoted in the database to be part of one string. Why don't you use just the `exec $string`? لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^2: Splitting up quoted/escaped command line arguments by Tommy (Chaplain) on Feb 11, 2014 at 18:23 UTC
...Because that sends it through a shell. I need to avoid that. Tommy A mistake can be valuable or costly, depending on how faithfully you pursue correction	[reply]
Re^3: Splitting up quoted/escaped command line arguments by salva (Canon) on Feb 11, 2014 at 18:30 UTC
But hose commands would not work unless you pass them through a shell! Or if you prefer to view it from a different angle, you will have to implement in your program all the shell functionality used by those commands! Update: And BTW, if those commands involve calling `ssh`, then a shell would be invoked at the remote side. This is an unavoidable feature of the SSH protocol.	[reply] [d/l]
Re^4: Splitting up quoted/escaped command line arguments by Tommy (Chaplain) on Feb 11, 2014 at 18:37 UTC
Re^5: Splitting up quoted/escaped command line arguments by salva (Canon) on Feb 11, 2014 at 21:51 UTC
Some notes below your chosen depth have not been shown here
Re^3: Splitting up quoted/escaped command line arguments by runrig (Abbot) on Feb 11, 2014 at 19:02 UTC
I need to avoid that. Why?	[reply]
Re^4: Splitting up quoted/escaped command line arguments by Tommy (Chaplain) on Feb 11, 2014 at 19:08 UTC
Re^5: Splitting up quoted/escaped command line arguments by runrig (Abbot) on Feb 11, 2014 at 19:11 UTC
Some notes below your chosen depth have not been shown here

Back to Seekers of Perl Wisdom