in reply to Re: RT::Client turns occasional binary characters in to wide characters
in thread RT::Client turns occasional binary characters in to wide characters

Please do not propagate the trap of using is_utf8 for Perl code. It does not indicate if the string you have is UTF-8 encoded bytes. It is only an internal flag for Perl's own use and XS code. It is possible, especially after people try hacks like this, or write incomplete XS code, to have byte-strings where is_utf8 is true, and character strings where is_utf8 is false. I would link to some RT bugs for more reading about the issue, but the website doesn't allow me to post them.
  • Comment on Re^2: RT::Client turns occasional binary characters in to wide characters

Replies are listed 'Best First'.
Re^3: RT::Client turns occasional binary characters in to wide characters
by wardmw (Acolyte) on Oct 08, 2018 at 15:53 UTC
    Thanks for the response. Given that this string that I am retrieving is actually the contents of a binary file then I should be OK to ignore anything to do with UTF8, given that my source code has no eight-bit or more characters.

    Working from that I removed every reference to UTF8 subroutines from my code but I still get this wide character complaint when I try and write the string contents out to a binary (or any) file. So I have removed one potential issue (UTF8) but it's still got a problem.

    While I take you at your word that this is not a UTF8 problem (as I understand it) It's odd that running encode('UTF-8'... against the string and writing the results out does not generate this wide character warning.

      Given that this string that I am retrieving is actually the contents of a binary file then I should be OK to ignore anything to do with UTF8, given that my source code has no eight-bit or more characters.

      It depends on how the data is handed to you. Note how below, both byte sequences are \304\243, but they're getting different interpretations based on Perl's internal UTF8 flag. If the module is handing you binary data with some encoding/decoding issues or perhaps the UTF8 flag incorrectly enabled, you'll have these kinds of strange issues that may explain the presence of U+FFFD REPLACEMENT CHARACTER in your original hex dump. Could you show your data with Devel::Peek?

      $ perl -CSD -MDevel::Peek -le 'my $x="\x{123}"; print $x; Dump($x)'
      ģ
      SV = PV(0x1337d70) at 0x1357518
        REFCNT = 1
        FLAGS = (POK,IsCOW,pPOK,UTF8)
        PV = 0x1359790 "\304\243"\0 [UTF8 "\x{123}"]
        CUR = 2
        LEN = 10
        COW_REFCNT = 1
      $ perl -CSD -MDevel::Peek -le 'my $x="\304\243"; print $x; Dump($x)'
      ģ
      SV = PV(0x1e28d70) at 0x1e48518
        REFCNT = 1
        FLAGS = (POK,IsCOW,pPOK)
        PV = 0x1e4a790 "\304\243"\0
        CUR = 2
        LEN = 10
        COW_REFCNT = 1