in reply to Re^5: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255
in thread JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255

Except it's kind of hard to understand what the heck the function is doing. 'flagged as utf8', 'store a string internally'... too many implementation details.

This is my very problem with you: You bring up internal details for no reason. And these implementation details just end up confusing people, not helping them.

Except it's kind of hard to understand what the heck the function is doing

That a module is badly documented is not Perl's fault.

Maybe you missed that, ikegami... but I actually never have any problems with mojibake in my Perl code...

Yeah, I know you know you know better.

I've called it 'upgrading' (in quotes) in honor of utf8::upgrade

That doesn't double encode either. That doesn't change the string at all. (Remove the upgrade from your code and you get the same output.)

Not sure why you even mentioned _utf8_on

_utf8_on and utf8::upgrade both end up with an upgraded string, _utf8_one is the one used throughout the docs for Test::utf8, and your comment was wrong whichever function you were talking about.

  • Comment on Re^6: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255
  • Select or Download Code

Replies are listed 'Best First'.
Re^7: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255
by Anonymous Monk on Dec 07, 2014 at 22:50 UTC
    That doesn't double encode either. That doesn't change the string at all. (Remove the upgrade from your code and you get the same output.)
    LOL you just won't budge! Now, why doesn't utf8::upgrade 'change' the string if it does the following: Converts in-place the internal representation of the string from an octet sequence in the native encoding (Latin-1 or EBCDIC) to UTF-X

    Am I to assume that it won't 'change' the string only if the string is indeed in 'native encoding', which can be Latin-1 or EBCDIC? (but not, say, UTF-8?)

    That doesn't seem terribly confusing. Yes, it doesn't do double encoding, as far as I can tell, it looks more like single decoding (perl -MEncode=decode -MDevel::Peek -E 'my $s = "\xff"; Dump decode("Latin-1", $s); utf8::upgrade($s); Dump $s').
    _utf8_on and utf8::upgrade both end up with an upgraded string, _utf8_one is the one used throughout the docs for Test::utf8, and your comment was wrong whichever function you were talking about.
    And utf8::upgrade is the one that produces exactly the kind of strings that is_sane_utf8 is intended to catch, so 'upgraded' strings is a good enough description ('decoded from Latin-1' is also good), unlike 'double encoding', where, for is_sane_utf8 purposes, the problem is neither with encoding, nor does something needs to happen twice.

      why doesn't utf8::upgrade 'change' the string if it does the following:

      Because it's meant not to, and none of that changes the string.

      And utf8::upgrade is the one that produces exactly the kind of strings that is_sane_utf8 is intended to catch, so 'upgraded' strings is a good enough description

      Nope. It only flags some upgraded strings.

      unlike 'double encoding',

      Yeah, saying it checks for that would be wrong too. I just took your word for it earlier.