in reply to Re^3: Having to manually escape quote character in args to "system"?
in thread Having to manually escape quote character in args to "system"?

The CommandLineToArgV convention mentioned in "Everyone quotes command line arguments the wrong way" is just that - a convention. All programs are free to use different quoting rules, and at least legacy programs do have different rules.

It seems I was way too optimistic. Not only legacy programs, but also modern programs don't follow the CommandLineToArgV convention. I stumbled upon an article by Chris Wellons, The wild west of Windows command line parsing, from 2022. He ran down the rabbit hole, in an attempt to get rid of the standard libc, and it is even worse than I thought. Yes, there is an API function for splitting the command line string, called CommandLineToArgvW(), which needs to be called with the command line in "wide" (UCS-2) format from GetCommandLineW(). But that API function is burried in shell32.dll, which you might want to avoid linking in. And so:

Many runtimes, including Microsoft’s own CRTs, don’t call CommandLineToArgvW and instead do their own parsing. It’s messier than I expected, and when I started digging into it I wasn’t expecting it to involve a few days of research.

The GetCommandLineW has a rough explanation: split arguments on whitespace (not defined), quoting is involved, and there’s something about counting backslashes, but only if they stop on a quote. It’s not quite enough to implement your own, and if you test against it, it’s quickly apparent that this documentation is at best incomplete. It links to a deprecated page about parsing C++ command line arguments with a few more details. Unfortunately the algorithm described on this page is not the algorithm used by GetCommandLineW, nor is it used by any runtime I could find. It even varies between Microsoft’s own CRTs. There is no canonical command line parsing result, not even a de facto standard.

I eventually came across David Deley’s How Command Line Parameters Are Parsed, which is the closest there is to an authoritative document on the matter (also). Unfortunately it focuses on runtimes rather than CommandLineToArgvW, and so some of those details aren’t captured. In particular, the first argument (i.e. argv[0]) follows entirely different rules, which really confused me for while. The Wine documentation was helpful particularly for CommandLineToArgvW. As far as I can tell, they’ve re-implemented it perfectly, matching it bug-for-bug as they do.

(Emphasis mine)

Chris Wellons also compares to other implementations:

I also peeked at some language runtimes to see how others handle it. Just as expected, Mingw-w64 has the behavior of an old (pre-2008) Microsoft CRT. Also expected, CPython implicitly does whatever the underlying C runtime does, so its exact command line behavior depends on which version of Visual Studio was used to build the Python binary. OpenJDK pragmatically calls CommandLineToArgvW. Go (gc) does its own parsing, with behavior mixed between CommandLineToArgvW and some of Microsoft’s CRTs, but not quite matching either.

And he also researched and implemented the inverse function, creating a command line from an array of strings for which CommandLineToArgvW() returns the same array of strings. Surprise: There is none.

I’ve always been boggled as to why there’s no complementary inverse to CommandLineToArgvW. When spawning processes with arbitrary arguments, everyone is left to implement the inverse of this under-specified and non-trivial command line format to serialize an argv. Hopefully the receiver parses it compatibly! There’s no falling back on a system routine to help out. This has lead to a lot of repeated effort: it’s not limited to high level runtimes, but almost any extensible application (itself a kind of runtime). Fortunately serializing is not quite as complex as parsing since many of the edge cases simply don’t come up if done in a straightforward way.

He searched for other implementations:

How do others handle this?

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re^5: Having to manually escape quote character in args to "system"?
by afoken (Chancellor) on Feb 16, 2023 at 19:04 UTC

    David Deley’s How Command Line Parameters Are Parsed is not only well written, it also shows how messy the entire DOS/Windows command line handling is.

    Printed to a DIN A4 PDF document, this fills 31 pages. The entire command line parameters on Unix are explained on a single page, plus a heading on the previous page, plus an extra line on the following page, plus four footnotes (19, 1, 17, 18). And that including three examples. Let's say one and a half page. Another page is used for the table of contents, and the final page just contains a copyright and updates. The remaining 27.5 pages explain what a stinking mess Windows command line parsing is, and how to work around the different parsing rules.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)