WARNING What follows WILL NOT SOLVE YOUR ORIGINAL PROBLEM! To understand this, scan to the second WARNING! close to the bottom.

Also, whilst all the code and the output below are live output from one of my test sessions with some (where some equals lots:) of my failed attempts to understand Unicode omitted. Every single line is real code and real output, but that doesn't mean all the conclusion I draw or the assumptions I make are correct. Only that they fit my observations using limited test data!

Tanaka as tye already mentioned, pack( 'B*', 1110110010010000 ) is packing large decimal number.

What noone has mentioned yet (as I'm writing, by the time I post it maybe different:), is that, if you really want to pack an ascii encoded binary string into bytes, you need to use quotes around it.

$bobpacked = pack( 'B*', '1110110010010000' ) print $bobpacked 8É $bobagain = unpack( 'B*', $bobpacked ) print $bobagain 1110110010010000

That said, what you are really trying to do is create bytes from their binary representation (without using the chr function as you described in Weird UTF stuff in FreeBSD) which is more easily done like this:

$bobpacked = pack 'S', 0b111011001001000 print $bobpacked Hv $bobagain = unpack 'S', $bobpacked printf '%b', $bobagain 111011001001000

Here Perl is reading 0b1110110010010000 and storing it in its native integer format (32-bits on my system, but this varies if you have 64-bit system I think!). As you know you want to pack two bytes and you don't want the high order bit to be treated as a two complement sign indicator, you should use the 'S' unsigned short pack format (rather than 's' and definately not 'B' which does something quite different!).

You can supply the output from pack to unpack with the same 'S' format in order to retreive the value, and printf '%b' to see it in binary again.

You may find it easier (as I do) to think in hex, in which case all following achieve exactly the same, but are maybe easier to read. You could also use octal.

printf '%b', unpack 'S', pack 'S', 0xEC90 1110110010010000 printf '%x', unpack 'S', pack 'S', 0xEC90 ec90 printf '%#X', unpack 'S', pack 'S', 0xEC90 0XEC90 printf '%#x', unpack 'S', pack 'S', 0xEC90 0xec90

However, if you try to extend this to 3-bytes (see further down why you will want to!).

printf '%#x', unpack 'S', pack 'S', 0xABCDEF 0xcdef

You'll come unstuck! This is because although the input to pack is 3 bytes long, I've specified 'S', which means pack will only pack the first 2 bytes! unpack duly unpacks the two bytes returned, hence the truncated output.

Now you might think of moving to using the 'I' 32-bit pack format, and this may seem as if it works:

printf '%#x', unpack 'I', pack 'I', 0xABCDEF 0xabcdef

However, printf is 'being nice' and decides that as the top 2 nybbles of this 32-bit number are zeros, you don't need to see them! If we ask nicely it will though:.

printf '%#8.8x', unpack 'I', pack 'I', 0xABCDEF 0x00abcdef

So, if you want to code 3 or 5 or 6 bytes (and you will), you will need to use the 'C' format, and pass each byte of your char to pack individually:

printf '%b', unpack 'C*', pack 'C*', 0xEC, 0x90 11101100

Looks good until

printf '%#8.8x', unpack 'C*', pack 'C*', 0xAB, 0xCD, 0xEF 0x000000ab

The problem here is that now unpack returns 3 numeric values, one for each byte as demonstrated here

print unpack 'C*', pack 'C*', 0xAB, 0xCD, 0xEF 171 205 239

The only way I have found to handle this (though I am sure that a more elegant solution is there somewhere) is:

($b1,$b2,$b3) = unpack 'C*', pack 'C*', 0xAB, 0xCD, 0xEF printf '%#x', ( (($b1 << 8) + $b2 ) << 8) + $b3 0xabcdef

I could use an array for the intermediate storage of the bytes, but that just make the code look worse. It is easy to see how this could be extended to handle any number of bytes. However.........

Warning! None of this will help you with encoding UCS (utf-8) chars, because the binary representation of a given codepoint number is NOT stored directly into binary.

For example, the character with the codepoint value 0b11101101_00100000, would need to be encoded as 0b11101110_10110100_10100000! Yes, 3-bytes! (the underscores are there only for clarification).

And, as you'll see in the first table in on the page referenced below, It uses upto 6 bytes to represent the full range of 2**31 codepoints!

To understand why this is so, the best source of information I have found on UCS (utf-8) character encoding is here.

After our first conversation the other night in the CB, I realised how little I knew about Unicode, and decided I really should know more. I did a little research and that was the best source of information I found.

I hope to have a usable work-around for your original problem in a couple of days.


What's this about a "crooked mitre"? I'm good at woodwork!

In reply to Re: Problems using pack by BrowserUk
in thread Problems using pack by Tanaka

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.