in reply to Re^30: Interleaving bytes in a string quickly
in thread Interleaving bytes in a string quickly

You only guaranteed the format in a node 20 deep

No. I guarenteed that a) in the title of the thread; b) when I wrote the code.

UTF-8 is a storage format. Typically, it's used to encode unicode characters, but Perl uses it internally to encode 32-bit integers (or 64-bit on a 64-bit build, I think).

Please demonstrate. Cos if that is true, it is something that has completely eluded me.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"I'd rather go naked than blow up my ass"
  • Comment on Re^31: Interleaving bytes in a string quickly

Replies are listed 'Best First'.
Re^32: Interleaving bytes in a string quickly
by ikegami (Patriarch) on Mar 01, 2010 at 17:39 UTC

    First, I misspoke a bit. Perl uses a utf8 internally, a Perl-specific derivative of UTF-8. UTF-8 can only encode values up to 10FFFF and is really meant for unicode characters, while utf8 can encode any UV.

    use Devel::Peek qw( Dump ); my $array = ''; for my $bit (0..63) { $array .= chr( 1 << $bit ); } Dump($array);
    SV = PV(0x511ae0) at 0x5118b0 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x531200 "\1\2\4\10\20 @\302\200\304\200\310\200\320\200\340\24 +0\200\341\200\200\342\200\200\344\200\200\350\200\200\360\220\200\200 +\360\240\200\200\361\200\200\200\362\200\200\200\364\200\200\200\370\ +210\200\200\200\370\220\200\200\200\370\240\200\200\200\371\200\200\2 +00\200\372\200\200\200\200\374\204\200\200\200\200\374\210\200\200\20 +0\200\374\220\200\200\200\200\374\240\200\200\200\200\375\200\200\200 +\200\200\376\202\200\200\200\200\200\376\204\200\200\200\200\200\376\ +210\200\200\200\200\200\376\220\200\200\200\200\200\376\240\200\200\2 +00\200\200\377\200\200\200\200\200\201\200\200\200\200\200\200\377\20 +0\200\200\200\200\202\200\200\200\200\200\200\377\200\200\200\200\200 +\204\200\200\200\200\200\200\377\200\200\200\200\200\210\200\200\200\ +200\200\200\377\200\200\200\200\200\220\200\200\200\200\200\200\377\2 +00\200\200\200\200\240\200\200\200\200\200\200\377\200\200\200\200\20 +1\200\200\200\200\200\200\200\377\200\200\200\200\202\200\200\200\200 +\200\200\200\377\200\200\200\200\204\200\200\200\200\200\200\200\377\ +200\200\200\200\210\200\200\200\200\200\200\200\377\200\200\200\200\2 +20\200\200\200\200\200\200\200\377\200\200\200\200\240\200\200\200\20 +0\200\200\200\377\200\200\200\201\200\200\200\200\200\200\200\200\377 +\200\200\200\202\200\200\200\200\200\200\200\200\377\200\200\200\204\ +200\200\200\200\200\200\200\200\377\200\200\200\210\200\200\200\200\2 +00\200\200\200\377\200\200\200\220\200\200\200\200\200\200\200\200\37 +7\200\200\200\240\200\200\200\200\200\200\200\200\377\200\200\201\200 +\200\200\200\200\200\200\200\200\377\200\200\202\200\200\200\200\200\ +200\200\200\200\377\200\200\204\200\200\200\200\200\200\200\200\200\3 +77\200\200\210\200\200\200\200\200\200\200\200\200\377\200\200\220\20 +0\200\200\200\200\200\200\200\200\377\200\200\240\200\200\200\200\200 +\200\200\200\200\377\200\201\200\200\200\200\200\200\200\200\200\200\ +377\200\202\200\200\200\200\200\200\200\200\200\200\377\200\204\200\2 +00\200\200\200\200\200\200\200\200\377\200\210\200\200\200\200\200\20 +0\200\200\200\200"\0 [UTF8 "\x{1}\x{2}\x{4}\x{8}\x{10} @\x{80}\x{100} +\x{200}\x{400}\x{800}\x{1000}\x{2000}\x{4000}\x{8000}\x{10000}\x{2000 +0}\x{40000}\x{80000}\x{100000}\x{200000}\x{400000}\x{800000}\x{100000 +0}\x{2000000}\x{4000000}\x{8000000}\x{10000000}\x{20000000}\x{4000000 +0}\x{80000000}\x{100000000}\x{200000000}\x{400000000}\x{800000000}\x{ +1000000000}\x{2000000000}\x{4000000000}\x{8000000000}\x{10000000000}\ +x{20000000000}\x{40000000000}\x{80000000000}\x{100000000000}\x{200000 +000000}\x{400000000000}\x{800000000000}\x{1000000000000}\x{2000000000 +000}..."] CUR = 504 LEN = 512

    Update: First para added.

      Even with your misspeak, that's just a bug in Perl.

      use Devel::Peek;; $a = '';; $a = chr( 65 );; Dump $a;; SV = PV(0x11cfc0) at 0x11f248 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x3d6ccc8 "A"\0 CUR = 1 LEN = 8 $a .= chr( 2**32 );; Dump $a;; SV = PV(0x11cfc0) at 0x11f248 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x3d6cbd8 "A\376\204\200\200\200\200\200"\0Malformed UTF-8 char +acter (byte 0xfe) in subroutine entry [UTF8 "A\x{0}"] CUR = 8 LEN = 16

      It allows you to construct a malformed utf-8 (unicode) string. It shouldn't.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Concerning that warning, it's seen as a tool if you're dealing with unicode characters (although a buggy one atm), one that you can turn off if you're dealing with strings of numbers.

        that's just a bug in Perl.

        No, it's quite intentional.

        If anything, the warning is seen as the bug.