The CommandLineToArgV convention mentioned in "Everyone quotes command line arguments the wrong way" is just that - a convention. All programs are free to use different quoting rules, and at least legacy programs do have different rules.

It seems I was way too optimistic. Not only legacy programs, but also modern programs don't follow the CommandLineToArgV convention. I stumbled upon an article by Chris Wellons, The wild west of Windows command line parsing, from 2022. He ran down the rabbit hole, in an attempt to get rid of the standard libc, and it is even worse than I thought. Yes, there is an API function for splitting the command line string, called CommandLineToArgvW(), which needs to be called with the command line in "wide" (UCS-2) format from GetCommandLineW(). But that API function is burried in shell32.dll, which you might want to avoid linking in. And so:

Many runtimes, including Microsoft’s own CRTs, don’t call CommandLineToArgvW and instead do their own parsing. It’s messier than I expected, and when I started digging into it I wasn’t expecting it to involve a few days of research.

The GetCommandLineW has a rough explanation: split arguments on whitespace (not defined), quoting is involved, and there’s something about counting backslashes, but only if they stop on a quote. It’s not quite enough to implement your own, and if you test against it, it’s quickly apparent that this documentation is at best incomplete. It links to a deprecated page about parsing C++ command line arguments with a few more details. Unfortunately the algorithm described on this page is not the algorithm used by GetCommandLineW, nor is it used by any runtime I could find. It even varies between Microsoft’s own CRTs. There is no canonical command line parsing result, not even a de facto standard.

I eventually came across David Deley’s How Command Line Parameters Are Parsed, which is the closest there is to an authoritative document on the matter (also). Unfortunately it focuses on runtimes rather than CommandLineToArgvW, and so some of those details aren’t captured. In particular, the first argument (i.e. argv[0]) follows entirely different rules, which really confused me for while. The Wine documentation was helpful particularly for CommandLineToArgvW. As far as I can tell, they’ve re-implemented it perfectly, matching it bug-for-bug as they do.

(Emphasis mine)

Chris Wellons also compares to other implementations:

I also peeked at some language runtimes to see how others handle it. Just as expected, Mingw-w64 has the behavior of an old (pre-2008) Microsoft CRT. Also expected, CPython implicitly does whatever the underlying C runtime does, so its exact command line behavior depends on which version of Visual Studio was used to build the Python binary. OpenJDK pragmatically calls CommandLineToArgvW. Go (gc) does its own parsing, with behavior mixed between CommandLineToArgvW and some of Microsoft’s CRTs, but not quite matching either.

And he also researched and implemented the inverse function, creating a command line from an array of strings for which CommandLineToArgvW() returns the same array of strings. Surprise: There is none.

I’ve always been boggled as to why there’s no complementary inverse to CommandLineToArgvW. When spawning processes with arbitrary arguments, everyone is left to implement the inverse of this under-specified and non-trivial command line format to serialize an argv. Hopefully the receiver parses it compatibly! There’s no falling back on a system routine to help out. This has lead to a lot of repeated effort: it’s not limited to high level runtimes, but almost any extensible application (itself a kind of runtime). Fortunately serializing is not quite as complex as parsing since many of the edge cases simply don’t come up if done in a straightforward way.

He searched for other implementations:

How do others handle this?

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

In reply to Re^4: Having to manually escape quote character in args to "system"? by afoken
in thread Having to manually escape quote character in args to "system"? by vr

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.