The CommandLineToArgV convention mentioned in "Everyone quotes command line arguments the wrong way" is just that - a convention. All programs are free to use different quoting rules, and at least legacy programs do have different rules.
It seems I was way too optimistic. Not only legacy programs, but also modern programs don't follow the CommandLineToArgV convention. I stumbled upon an article by Chris Wellons, The wild west of Windows command line parsing, from 2022. He ran down the rabbit hole, in an attempt to get rid of the standard libc, and it is even worse than I thought. Yes, there is an API function for splitting the command line string, called CommandLineToArgvW(), which needs to be called with the command line in "wide" (UCS-2) format from GetCommandLineW(). But that API function is burried in shell32.dll, which you might want to avoid linking in. And so:
Many runtimes, including Microsoft’s own CRTs, don’t call CommandLineToArgvW and instead do their own parsing. It’s messier than I expected, and when I started digging into it I wasn’t expecting it to involve a few days of research.
The GetCommandLineW has a rough explanation: split arguments on whitespace (not defined), quoting is involved, and there’s something about counting backslashes, but only if they stop on a quote. It’s not quite enough to implement your own, and if you test against it, it’s quickly apparent that this documentation is at best incomplete. It links to a deprecated page about parsing C++ command line arguments with a few more details. Unfortunately the algorithm described on this page is not the algorithm used by GetCommandLineW, nor is it used by any runtime I could find. It even varies between Microsoft’s own CRTs. There is no canonical command line parsing result, not even a de facto standard.
I eventually came across David Deley’s How Command Line Parameters Are Parsed, which is the closest there is to an authoritative document on the matter (also). Unfortunately it focuses on runtimes rather than CommandLineToArgvW, and so some of those details aren’t captured. In particular, the first argument (i.e. argv[0]) follows entirely different rules, which really confused me for while. The Wine documentation was helpful particularly for CommandLineToArgvW. As far as I can tell, they’ve re-implemented it perfectly, matching it bug-for-bug as they do.
(Emphasis mine)
Chris Wellons also compares to other implementations:
I also peeked at some language runtimes to see how others handle it. Just as expected, Mingw-w64 has the behavior of an old (pre-2008) Microsoft CRT. Also expected, CPython implicitly does whatever the underlying C runtime does, so its exact command line behavior depends on which version of Visual Studio was used to build the Python binary. OpenJDK pragmatically calls CommandLineToArgvW. Go (gc) does its own parsing, with behavior mixed between CommandLineToArgvW and some of Microsoft’s CRTs, but not quite matching either.
And he also researched and implemented the inverse function, creating a command line from an array of strings for which CommandLineToArgvW() returns the same array of strings. Surprise: There is none.
I’ve always been boggled as to why there’s no complementary inverse to CommandLineToArgvW. When spawning processes with arbitrary arguments, everyone is left to implement the inverse of this under-specified and non-trivial command line format to serialize an argv. Hopefully the receiver parses it compatibly! There’s no falling back on a system routine to help out. This has lead to a lot of repeated effort: it’s not limited to high level runtimes, but almost any extensible application (itself a kind of runtime). Fortunately serializing is not quite as complex as parsing since many of the edge cases simply don’t come up if done in a straightforward way.
He searched for other implementations:
How do others handle this?
- The aged Emacs implementation is written in C rather than Lisp, steeped in history with vestigial wrong turns. Emacs still only calls the “narrow” CreateProcessA despite having every affordance to do otherwise, and uses the wrong encoding at that. A personal source of headaches.
- CPython uses Python rather than C via subprocess.list2cmdline. While undocumented, it’s accessible on any platform and easy to test against various inputs. Try it out!
- Go (gc) is just as delightfully boring I’d expect.
- OpenJDK optimistically optimizes for command line strings under 80 bytes, and like Emacs, displays the weathering of long use.
Alexander
In reply to Re^4: Having to manually escape quote character in args to "system"?
by afoken
in thread Having to manually escape quote character in args to "system"?
by vr
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |