in reply to JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255

Ha! What would you say now, ikegami? When people like James Keenan and Ovid don't understand how this stuff works... can less experienced programmers even hope to ever get this right?
So the string containing guillemets («») is valid UTF-8, but the resulting JSON is not. What am I missing? The `utf8` pragma is correctly marking my source.
JSON::XS says:
(encode_json) Converts the given Perl data structure to a UTF-8 encoded, binary string (that is, the string contains octets only). Croaks on error.
Test::utf8 says:
(is_sane_utf8) This test fails if the string contains something that looks like it might be dodgy utf8, i.e. containing something that looks like the multi-byte sequence for a latin-1 character but perl hasn't been instructed to treat as such... This test fails when... The string contains utf8 byte sequences and the string hasn't been flagged as utf8 (this normally means that you got it from an external source like a C library;
Apparently it tests whether the string was properly decoded... (I'm not familiar with it). I guess you need to Encode::decode_utf8 it, before feeding the string to the second is_sane_utf8 (Test::utf8 has an example, with Encode::_utf8_on)
  • Comment on Re: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255
by ikegami (Patriarch) on Dec 07, 2014 at 04:20 UTC

    Ha! What would you say now, ikegami?

    That Ovid used a function without reading what it does first. My exact words are here.

Re^2: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255
by Anonymous Monk on Dec 07, 2014 at 03:11 UTC

    Ha! What would you say now, ikegami? When people like James Keenan and Ovid don't understand how this stuff works... can less experienced programmers even hope to ever get this right?

    So Ovid got confused about the basics, when dealing with some Test::: extas, so what? Its ok to get confused

      It's not the basics, this is the problem. I'm certainly NOT blaming people for becoming confused... It's Perl's problem (ikegami disagrees).

      Looking at the source of the test in question, is_sane_utf8 tests whether the string was improperly 'upgraded' (the so-called 'double encoding')... rejecting the JSON is more or less a side effect. Quickly, tell me, what that actually means?

        Damn right I disagree. It is not Perl's problem that someone using a function that's documented to check for accidental double-encoding to check if something is valid UTF-8. That's akin to using uc to get the first character of a string. There's nothing Perl can do to stop you from using a function completely unrelated to the one you want to use.

        This is the second time this thread you've implied that I maintain that Perl's handling of UTF-8 isn't confusing. That's a lie. The former bugs in Perl (some still present) and the plethora of buggy XS module (because XS is hard!) has led people like you to disseminate misinformation, which has created a self-feeding vicious loop of confused people. I've repeatedly said that Perl should be able to differentiate encoded strings from decoded strings and prevent you from mixing them.

        Speaking of misinformation, improper upgrading doesn't cause double-encoding. Quite the opposite, it causes a string encoded using UTF-8 to become decoded. (Upgrading a strings that isn't encoded using UTF-8 creates a corrupt scalar, as seen using perl -MDevel::Peek -MEncode=_utf8_on -we"$_ = qq{\x80}; _utf8_on($_); Dump($_)")

        Quickly, tell me, what that actually means?

        Double encoding is doing encode_utf8(encode_utf8($x)) when you mean to do encode_utf8($x).