Re: Problems using pack

WARNING What follows WILL NOT SOLVE YOUR ORIGINAL PROBLEM! To understand this, scan to the second WARNING! close to the bottom.

Also, whilst all the code and the output below are live output from one of my test sessions with some (where some equals lots:) of my failed attempts to understand Unicode omitted. Every single line is real code and real output, but that doesn't mean all the conclusion I draw or the assumptions I make are correct. Only that they fit my observations using limited test data!

Tanaka as tye already mentioned, pack( 'B*', 1110110010010000 ) is packing large decimal number.

What noone has mentioned yet (as I'm writing, by the time I post it maybe different:), is that, if you really want to pack an ascii encoded binary string into bytes, you need to use quotes around it.

$bobpacked = pack( 'B*', '1110110010010000' )

print $bobpacked
8Ã‰
$bobagain = unpack( 'B*', $bobpacked )

print $bobagain
1110110010010000
[download]

That said, what you are really trying to do is create bytes from their binary representation (without using the chr function as you described in Weird UTF stuff in FreeBSD) which is more easily done like this:

$bobpacked = pack 'S', 0b111011001001000

print $bobpacked
Hv

$bobagain = unpack 'S', $bobpacked

printf '%b', $bobagain
111011001001000
[download]

Here Perl is reading 0b1110110010010000 and storing it in its native integer format (32-bits on my system, but this varies if you have 64-bit system I think!). As you know you want to pack two bytes and you don't want the high order bit to be treated as a two complement sign indicator, you should use the 'S' unsigned short pack format (rather than 's' and definately not 'B' which does something quite different!).

You can supply the output from pack to unpack with the same 'S' format in order to retreive the value, and printf '%b' to see it in binary again.

You may find it easier (as I do) to think in hex, in which case all following achieve exactly the same, but are maybe easier to read. You could also use octal.

printf '%b', unpack 'S', pack 'S', 0xEC90
1110110010010000

printf '%x', unpack 'S', pack 'S', 0xEC90
ec90

printf '%#X', unpack 'S', pack 'S', 0xEC90
0XEC90

printf '%#x', unpack 'S', pack 'S', 0xEC90
0xec90
[download]

However, if you try to extend this to 3-bytes (see further down why you will want to!).

printf '%#x', unpack 'S', pack 'S', 0xABCDEF
0xcdef
[download]

You'll come unstuck! This is because although the input to pack is 3 bytes long, I've specified 'S', which means pack will only pack the first 2 bytes! unpack duly unpacks the two bytes returned, hence the truncated output.

Now you might think of moving to using the 'I' 32-bit pack format, and this may seem as if it works:

printf '%#x', unpack 'I', pack 'I', 0xABCDEF
0xabcdef
[download]

However, printf is 'being nice' and decides that as the top 2 nybbles of this 32-bit number are zeros, you don't need to see them! If we ask nicely it will though:.

printf '%#8.8x', unpack 'I', pack 'I', 0xABCDEF
0x00abcdef
[download]

So, if you want to code 3 or 5 or 6 bytes (and you will), you will need to use the 'C' format, and pass each byte of your char to pack individually:

printf '%b', unpack 'C*', pack 'C*',  0xEC, 0x90
11101100
[download]

Looks good until

printf '%#8.8x', unpack 'C*', pack 'C*',  0xAB, 0xCD, 0xEF
0x000000ab
[download]

The problem here is that now unpack returns 3 numeric values, one for each byte as demonstrated here

print unpack 'C*', pack 'C*',  0xAB, 0xCD, 0xEF
171 205 239
[download]

The only way I have found to handle this (though I am sure that a more elegant solution is there somewhere) is:

($b1,$b2,$b3) = unpack 'C*', pack 'C*', 0xAB, 0xCD, 0xEF
printf  '%#x', ( (($b1 << 8) + $b2 ) << 8) + $b3
0xabcdef
[download]

I could use an array for the intermediate storage of the bytes, but that just make the code look worse. It is easy to see how this could be extended to handle any number of bytes. However.........

Warning! None of this will help you with encoding UCS (utf-8) chars, because the binary representation of a given codepoint number is NOT stored directly into binary.

For example, the character with the codepoint value 0b11101101_00100000, would need to be encoded as 0b11101110_10110100_10100000! Yes, 3-bytes! (the underscores are there only for clarification).

And, as you'll see in the first table in on the page referenced below, It uses upto 6 bytes to represent the full range of 2**31 codepoints!

To understand why this is so, the best source of information I have found on UCS (utf-8) character encoding is here.

After our first conversation the other night in the CB, I realised how little I knew about Unicode, and decided I really should know more. I did a little research and that was the best source of information I found.

I hope to have a usable work-around for your original problem in a couple of days.

What's this about a "crooked mitre"? I'm good at woodwork!

Comment on Re: Problems using pack Select or Download Code