use bytes vs packed data

RDOlson has asked for the wisdom of the Perl Monks concerning the following question:

Greetings perl monks,

I'm trying to send packed data through RabbitMQ using Net::RabbitMQ, and am running into problems with the data being munged on the way into the library.

I'm packing a couple strings using pack("n/an/a", $str1, $str2). If I hexdump the buffer it looks as I expect:

 03 B0 2D 2D 2D 0A 43 4F - 4E 54 45 4E 54 5F 4C 45  ..---.CONTENT_LE
[download]

All is good, 03 B0 == 944 == the length of $str1. However, when I invoke the RabbitMQ method Net::RabbitMQ::publish, an XS method which I have hacked to print the first few bytes of the input data, I find it has been changed to 03 c2 b0 2d. This feels like a charset manipulation is going on.

If I add use bytes to the module invoking the call, all is well. However, the documentation for use bytes strongly discourages its use for anything other than documentation. It is unclear to me exactly how to properly proceed.

This is with perl 5.12.2 on 64-bit intel.

Thank you for your wisdom,

--bob

Comment on use bytes vs packed data Select or Download Code

Replies are listed 'Best First'.
Re: use bytes vs packed data by choroba (Cardinal) on May 02, 2011 at 16:25 UTC
s/than documentation/than debugging purposes/	[reply]
Re^2: use bytes vs packed data by RDOlson (Initiate) on May 02, 2011 at 18:17 UTC
Heh, sigh, yes, that's what I get for not actually copying and pasting.	[reply]
Re: use bytes vs packed data by John M. Dlugosz (Monsignor) on May 02, 2011 at 23:26 UTC
I agree, the character U+00B0, which would fit in 8 bits, is expressed as UTF8 as the sequence of bytes C2 B0. I think it has to do with your string being marked as an 8-bit string and then some manipulation produces a UTF-8 string with the same sequence of code points. The details of when string operations on mixed inputs produce wide or narrow depends on the UTF8 pragma, as does whether string literals are wide or narrow. The more correct way to handle it is to us "encoding" functions. But I think you are getting this behavior from functions that are already written, right?	[reply]
Re: use bytes vs packed data by John M. Dlugosz (Monsignor) on May 03, 2011 at 04:57 UTC
BTW, I just noticed encoding::warnings. Might be useful here?	[reply]
Re^2: use bytes vs packed data by RDOlson (Initiate) on May 03, 2011 at 15:49 UTC
Interesting, it seems it would be but when I turn off `use bytes` and add `use encoding::warnings` I get the incorrect behavior with no warnings. The code fragment where this is happening is as follows, in case it helps. `$conn` is a Net::RabbitMQ connection object. `$params` is a FCGI parameters hash. `$packed_data` is the string that is getting munged. `my $s = YAML::Dump($params); print "pack length " . length($s) . "\n"; my $packed_data = pack("N/aN/a", $s, $in); $conn->publish($channel, "rpc.$function", $packed_data, { exchange => $exchange_name }, { content_type => $type, correlation_id => $uuid_str, reply_to => $queue_name, });` [download]	[reply] [d/l] [select]
Re^3: use bytes vs packed data by John M. Dlugosz (Monsignor) on May 04, 2011 at 02:05 UTC
There is a function to find out whether the string is stored as utf8 or 8-bit, but I can't remember what it's called. You might try exploring the different values and see.	[reply]
Re^4: use bytes vs packed data by ikegami (Patriarch) on May 04, 2011 at 16:45 UTC
Re^5: use bytes vs packed data by John M. Dlugosz (Monsignor) on May 04, 2011 at 21:43 UTC
Re: use bytes vs packed data by ikegami (Patriarch) on May 04, 2011 at 16:44 UTC
On a hunch, check if the following helps: `utf8::downgrade( $string_to_pass_to_rabbit );` [download] If so, that would indicate a bug in Rabbit. If not, let me know and I'll look into it. Update: It could also indicate a bad input. I would also appreciate the output of `use Devel::Peek; Dump( $string_to_pass_to_rabbit );` [download]	[reply] [d/l] [select]
Re^2: use bytes vs packed data by RDOlson (Initiate) on May 04, 2011 at 19:56 UTC
With use bytes turned off: `SV = PV(0x1c71c9e8) at 0x1c202710 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1cab9940 "\0\0\3\302\261---\nCONTENT_LENGTH: [etc]` [download] With use bytes turned on: `SV = PV(0xbfdd1e8) at 0xb734730 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0xbfeb710 "\0\0\3\261---\nCONTENT_LENGTH: [etc]` [download] With use bytes turned off and with calling utf8::downgrade on the string before dumping: `SV = PV(0xd5ed9e8) at 0xd0d3710 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0xd98b100 "\0\0\3\261---\nCONTENT_LENGTH: [etc]` [download] Calls to `Encode::is_utf8` are also tracking those values as noted by the dump. aHA. Resorted to reading the code in pp_pack.c, discovered that the UTFness of a packed string appears to be based on the UTFness of its components (which makes sense). The second string I'm packing here came from the Net::Async::FastCGI::Request stdin data, and was UTF8-flagged. So the question turns into how to convince that module to forget about encodings in its I/O - the code I'm building here is just relaying bits from one place to another and shouldn't be touching them.	[reply] [d/l] [select]
Re^3: use bytes vs packed data by ikegami (Patriarch) on May 05, 2011 at 16:08 UTC
The second string I'm packing here came from the Net::Async::FastCGI::Request stdin data, and was UTF8-flagged. From what you say, it sounds like rabbit wants bytes. Did you encode the data? Why are you decoding an HTTP request in the first place?	[reply]
Re^4: use bytes vs packed data by RDOlson (Initiate) on May 05, 2011 at 21:55 UTC