Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a library that is returning a scalar to me that incorrectly has its utf8 flag set. The scalar contains "binary" data that should not be treated as utf8.

I want to use length() and print() on the scalar.

Assuming I can't fix the broken library code, what is the best way to handle this? Do I "use bytes"? Do I turn off the utf8 flag (via Encode or pack)? Are there other ways to deal with this?

Replies are listed 'Best First'.
Re: deal with incorrectly set utf8 flag
by ikegami (Patriarch) on Mar 27, 2009 at 16:15 UTC

    It's not clear what you want.

    • utf8::downgrade switches from bytes internally encoded as UTF-8 to just bytes.

      $ perl -MDevel::Peek -e'utf8::upgrade $x="\200\201"; Dump $x; utf8::do +wngrade $x; Dump $x' PV = 0x81623f0 [UTF8 "\x{80}\x{81}"] PV = 0x81623f0 "\200\201"\0
    • utf8::encode will re-encode the data that has been decoded from UTF-8.

      $ perl -MDevel::Peek -e'utf8::decode $x="\x{2660}"; Dump $x; utf8::enc +ode $x; Dump $x' PV = 0x81651c0 [UTF8 "\x{2660}"] PV = 0x81651c0 "\342\231\240"\0
    • If it's truly an incorrectly set flag, there's also Encode::_utf8_off. It should only be used if the above two don't work.

      $ perl -MDevel::Peek -MEncode=_utf8_on,_utf8_off -e'_utf8_on( $x="\200 +\201" ); Dump $x; _utf8_off $x; Dump $x' PV = 0x81651e8 [UTF8 "\x{1}@"] PV = 0x81651e8 "\200\201"\0

    References:
    utf8 (You don't need to load the module to use its subs.)
    Encode

    Update: Added code.

      It appears that utf8::downgrade alters the bytes in the scalar. I'm exactly seeking to avoid this. I want to use the bytes in the scalar without Perl's utf8-handling kicking in.

      The library in question read binary (non-character) data from a database field. It had no business marking the data as utf8, but it did so anyway. Now I'm looking to use the binary data without any conversions or warnings.

      It looks like I can use Encode::_utf8_off, but this is documented as an internal function that shouldn't be relied on. It looks like "use bytes" works, but I don't know if this is the way it should be done. I am looking to find the way.

        this is documented as an internal function that shouldn't be relied on.

        It means you should normally use utf8::encode or Encode::encode 'UTF-8'.

        $ perl -MDevel::Peek -MEncode=_utf8_on,_utf8_off,encode -e' _utf8_on( $x = "\342\231\240" ); utf8::encode( my $utf8 = $x ); my $enc = encode("UTF-8", $x); _utf8_off( my $off = $x ); Dump $x; Dump $utf8; Dump $enc; Dump $off; ' PV = 0x8165280 "\342\231\240"\0 [UTF8 "\x{2660}"] PV = 0x81623f0 "\342\231\240"\0 PV = 0x81920e8 "\342\231\240"\0 PV = 0x81ff040 "\342\231\240"\0

        But if _utf8_on or equivalent was wrongly used, _utf8_off is appropriate.