in reply to Re^29: Interleaving bytes in a string quickly
in thread Interleaving bytes in a string quickly

It places no interpretation upon what it is that memory.

I know! You've said this a dozen times already. And what format is that pointed memory in? You only guaranteed the format in a node 20 deep or so.

It places no interpretation upon what it is that memory.

That aside, isn't utf-8 a "form of unicode."?

Look again.

Ah yes, you only said "codepoint", not "unicode". That usually mean "unicode codepoints", but you didn't imply any character semantics.

That aside, isn't utf-8 a "form of unicode."?

Unicode is a character set. You're clearly not dealing with characters.

UTF-8 is a storage format. Typically, it's used to encode unicode characters, but Perl uses it internally to encode 32-bit or 64-bit integers (depending on your build). Those integers may be codepoints, but that applies to UTF8=0 strings too.

  • Comment on Re^30: Interleaving bytes in a string quickly

Replies are listed 'Best First'.
Re^31: Interleaving bytes in a string quickly
by BrowserUk (Patriarch) on Mar 01, 2010 at 17:37 UTC
    You only guaranteed the format in a node 20 deep

    No. I guarenteed that a) in the title of the thread; b) when I wrote the code.

    UTF-8 is a storage format. Typically, it's used to encode unicode characters, but Perl uses it internally to encode 32-bit integers (or 64-bit on a 64-bit build, I think).

    Please demonstrate. Cos if that is true, it is something that has completely eluded me.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      First, I misspoke a bit. Perl uses a utf8 internally, a Perl-specific derivative of UTF-8. UTF-8 can only encode values up to 10FFFF and is really meant for unicode characters, while utf8 can encode any UV.

      use Devel::Peek qw( Dump ); my $array = ''; for my $bit (0..63) { $array .= chr( 1 << $bit ); } Dump($array);
      SV = PV(0x511ae0) at 0x5118b0 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x531200 "\1\2\4\10\20 @\302\200\304\200\310\200\320\200\340\24 +0\200\341\200\200\342\200\200\344\200\200\350\200\200\360\220\200\200 +\360\240\200\200\361\200\200\200\362\200\200\200\364\200\200\200\370\ +210\200\200\200\370\220\200\200\200\370\240\200\200\200\371\200\200\2 +00\200\372\200\200\200\200\374\204\200\200\200\200\374\210\200\200\20 +0\200\374\220\200\200\200\200\374\240\200\200\200\200\375\200\200\200 +\200\200\376\202\200\200\200\200\200\376\204\200\200\200\200\200\376\ +210\200\200\200\200\200\376\220\200\200\200\200\200\376\240\200\200\2 +00\200\200\377\200\200\200\200\200\201\200\200\200\200\200\200\377\20 +0\200\200\200\200\202\200\200\200\200\200\200\377\200\200\200\200\200 +\204\200\200\200\200\200\200\377\200\200\200\200\200\210\200\200\200\ +200\200\200\377\200\200\200\200\200\220\200\200\200\200\200\200\377\2 +00\200\200\200\200\240\200\200\200\200\200\200\377\200\200\200\200\20 +1\200\200\200\200\200\200\200\377\200\200\200\200\202\200\200\200\200 +\200\200\200\377\200\200\200\200\204\200\200\200\200\200\200\200\377\ +200\200\200\200\210\200\200\200\200\200\200\200\377\200\200\200\200\2 +20\200\200\200\200\200\200\200\377\200\200\200\200\240\200\200\200\20 +0\200\200\200\377\200\200\200\201\200\200\200\200\200\200\200\200\377 +\200\200\200\202\200\200\200\200\200\200\200\200\377\200\200\200\204\ +200\200\200\200\200\200\200\200\377\200\200\200\210\200\200\200\200\2 +00\200\200\200\377\200\200\200\220\200\200\200\200\200\200\200\200\37 +7\200\200\200\240\200\200\200\200\200\200\200\200\377\200\200\201\200 +\200\200\200\200\200\200\200\200\377\200\200\202\200\200\200\200\200\ +200\200\200\200\377\200\200\204\200\200\200\200\200\200\200\200\200\3 +77\200\200\210\200\200\200\200\200\200\200\200\200\377\200\200\220\20 +0\200\200\200\200\200\200\200\200\377\200\200\240\200\200\200\200\200 +\200\200\200\200\377\200\201\200\200\200\200\200\200\200\200\200\200\ +377\200\202\200\200\200\200\200\200\200\200\200\200\377\200\204\200\2 +00\200\200\200\200\200\200\200\200\377\200\210\200\200\200\200\200\20 +0\200\200\200\200"\0 [UTF8 "\x{1}\x{2}\x{4}\x{8}\x{10} @\x{80}\x{100} +\x{200}\x{400}\x{800}\x{1000}\x{2000}\x{4000}\x{8000}\x{10000}\x{2000 +0}\x{40000}\x{80000}\x{100000}\x{200000}\x{400000}\x{800000}\x{100000 +0}\x{2000000}\x{4000000}\x{8000000}\x{10000000}\x{20000000}\x{4000000 +0}\x{80000000}\x{100000000}\x{200000000}\x{400000000}\x{800000000}\x{ +1000000000}\x{2000000000}\x{4000000000}\x{8000000000}\x{10000000000}\ +x{20000000000}\x{40000000000}\x{80000000000}\x{100000000000}\x{200000 +000000}\x{400000000000}\x{800000000000}\x{1000000000000}\x{2000000000 +000}..."] CUR = 504 LEN = 512

      Update: First para added.

        Even with your misspeak, that's just a bug in Perl.

        use Devel::Peek;; $a = '';; $a = chr( 65 );; Dump $a;; SV = PV(0x11cfc0) at 0x11f248 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x3d6ccc8 "A"\0 CUR = 1 LEN = 8 $a .= chr( 2**32 );; Dump $a;; SV = PV(0x11cfc0) at 0x11f248 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x3d6cbd8 "A\376\204\200\200\200\200\200"\0Malformed UTF-8 char +acter (byte 0xfe) in subroutine entry [UTF8 "A\x{0}"] CUR = 8 LEN = 16

        It allows you to construct a malformed utf-8 (unicode) string. It shouldn't.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.