in reply to Problems using pack
WARNING What follows WILL NOT SOLVE YOUR ORIGINAL PROBLEM! To understand this, scan to the second WARNING! close to the bottom.
Also, whilst all the code and the output below are live output from one of my test sessions with some (where some equals lots:) of my failed attempts to understand Unicode omitted. Every single line is real code and real output, but that doesn't mean all the conclusion I draw or the assumptions I make are correct. Only that they fit my observations using limited test data!
Tanaka as tye already mentioned, pack( 'B*', 1110110010010000 ) is packing large decimal number.
What noone has mentioned yet (as I'm writing, by the time I post it maybe different:), is that, if you really want to pack an ascii encoded binary string into bytes, you need to use quotes around it.
$bobpacked = pack( 'B*', '1110110010010000' ) print $bobpacked 8É $bobagain = unpack( 'B*', $bobpacked ) print $bobagain 1110110010010000
That said, what you are really trying to do is create bytes from their binary representation (without using the chr function as you described in Weird UTF stuff in FreeBSD) which is more easily done like this:
$bobpacked = pack 'S', 0b111011001001000 print $bobpacked Hv $bobagain = unpack 'S', $bobpacked printf '%b', $bobagain 111011001001000
Here Perl is reading 0b1110110010010000 and storing it in its native integer format (32-bits on my system, but this varies if you have 64-bit system I think!). As you know you want to pack two bytes and you don't want the high order bit to be treated as a two complement sign indicator, you should use the 'S' unsigned short pack format (rather than 's' and definately not 'B' which does something quite different!).
You can supply the output from pack to unpack with the same 'S' format in order to retreive the value, and printf '%b' to see it in binary again.
You may find it easier (as I do) to think in hex, in which case all following achieve exactly the same, but are maybe easier to read. You could also use octal.
printf '%b', unpack 'S', pack 'S', 0xEC90 1110110010010000 printf '%x', unpack 'S', pack 'S', 0xEC90 ec90 printf '%#X', unpack 'S', pack 'S', 0xEC90 0XEC90 printf '%#x', unpack 'S', pack 'S', 0xEC90 0xec90
However, if you try to extend this to 3-bytes (see further down why you will want to!).
printf '%#x', unpack 'S', pack 'S', 0xABCDEF 0xcdef
You'll come unstuck! This is because although the input to pack is 3 bytes long, I've specified 'S', which means pack will only pack the first 2 bytes! unpack duly unpacks the two bytes returned, hence the truncated output.
Now you might think of moving to using the 'I' 32-bit pack format, and this may seem as if it works:
printf '%#x', unpack 'I', pack 'I', 0xABCDEF 0xabcdef
However, printf is 'being nice' and decides that as the top 2 nybbles of this 32-bit number are zeros, you don't need to see them! If we ask nicely it will though:.
printf '%#8.8x', unpack 'I', pack 'I', 0xABCDEF 0x00abcdef
So, if you want to code 3 or 5 or 6 bytes (and you will), you will need to use the 'C' format, and pass each byte of your char to pack individually:
printf '%b', unpack 'C*', pack 'C*', 0xEC, 0x90 11101100
Looks good until
printf '%#8.8x', unpack 'C*', pack 'C*', 0xAB, 0xCD, 0xEF 0x000000ab
The problem here is that now unpack returns 3 numeric values, one for each byte as demonstrated here
print unpack 'C*', pack 'C*', 0xAB, 0xCD, 0xEF 171 205 239
The only way I have found to handle this (though I am sure that a more elegant solution is there somewhere) is:
($b1,$b2,$b3) = unpack 'C*', pack 'C*', 0xAB, 0xCD, 0xEF printf '%#x', ( (($b1 << 8) + $b2 ) << 8) + $b3 0xabcdef
I could use an array for the intermediate storage of the bytes, but that just make the code look worse. It is easy to see how this could be extended to handle any number of bytes. However.........
Warning! None of this will help you with encoding UCS (utf-8) chars, because the binary representation of a given codepoint number is NOT stored directly into binary.
For example, the character with the codepoint value 0b11101101_00100000, would need to be encoded as 0b11101110_10110100_10100000! Yes, 3-bytes! (the underscores are there only for clarification).
And, as you'll see in the first table in on the page referenced below, It uses upto 6 bytes to represent the full range of 2**31 codepoints!
To understand why this is so, the best source of information I have found on UCS (utf-8) character encoding is here.
After our first conversation the other night in the CB, I realised how little I knew about Unicode, and decided I really should know more. I did a little research and that was the best source of information I found.
I hope to have a usable work-around for your original problem in a couple of days.
|
|---|