in reply to JSON character encoding

JSON, being a data exchange format, does not have a concept of “character encodings.”   Its purpose is to allow structured data to be correctly transmitted from place to place, particularly in contexts where HTML and the use of “web servers” is assumed.   (Secondarily, it was intended to be friendly to JavaScript, so that you can use Global.eval(), as in fact people did do in younger and more-innocent days.)

Anytime you display or “print” a received data-stream, character encoding does play a part in the process that is used to create what you see.   Most of the time, UTF encoding is assumed, and the software in question watches-out for the telltale byte sequences in order to “be helpful” and “do the right thing.”   This can get in the way, however, when you are debugging.   Sometimes the best thing to do is to dump the relevant parts in hexadecimal:

00000000 50 65 72 6c 20 70 72 6f 67 72 61 6d 0a |Perl program.|

When you subsequently display the data “for real,” if you know that the data is UTF-encoded, you must somehow tell this to the final rendering device.   In an HTML data-stream, for instance, this is done by means of a header.   The necessity to use things like #1234 depends entirely on what the device has been told to expect.   (Don’t assume that it will “assume.”)

Replies are listed 'Best First'.
Re^2: JSON character encoding
by Your Mother (Archbishop) on May 30, 2017 at 14:53 UTC
    JSON, being a data exchange format, does not have a concept of “character encodings.”

    This is inaccurate. All character data must have an encoding associated or it is a guessing game of binary noise. The JSON spec dictates–

    JSON documents can be encoded in UTF-8, UTF-16 or UTF-32, the default encoding being UTF-8.
      Fine. JSON has a very restricted concept of character encodings.

      This statement is true – but implied.   Therefore, let me clarify my previous statement.

      The character-encoding of the transferred data must be agreed-upon by both the sending and the receiving parties.   The specification says that the data can be UTF-encoded, which is to say that the JSON data-format has no encoding scheme of its own.   Furthermore, the character-encoding (or lack thereof) of the transferred data has no bearing on how the data is packaged into syntactically valid, parseable, “JSON.”   The structural format does not contain nor rely upon Unicode characters.   Rather, it is agnostic to them.   It is equally capable of “sending the necessary bytes to you, correctly,” no matter how you consider the bytes that you received “to have been ‘encoded,’ if at all.”

      Thank you for the clarification, “Mom.”   That was, indeed, an important point.

        What are you talking about? If the JSON standard dictates a set of encodings and you violate that standard, the file you produce is no longer JSON ... by definition. If it's not UTF, it's not JSON and arguments to the contrary are a waste of time.