RDOlson has asked for the wisdom of the Perl Monks concerning the following question:

Greetings perl monks,

I'm trying to send packed data through RabbitMQ using Net::RabbitMQ, and am running into problems with the data being munged on the way into the library.

I'm packing a couple strings using  pack("n/an/a", $str1, $str2). If I hexdump the buffer it looks as I expect:

03 B0 2D 2D 2D 0A 43 4F - 4E 54 45 4E 54 5F 4C 45 ..---.CONTENT_LE

All is good, 03 B0 == 944 == the length of $str1. However, when I invoke the RabbitMQ method Net::RabbitMQ::publish, an XS method which I have hacked to print the first few bytes of the input data, I find it has been changed to 03 c2 b0 2d. This feels like a charset manipulation is going on.

If I add  use bytes to the module invoking the call, all is well. However, the documentation for use bytes strongly discourages its use for anything other than documentation. It is unclear to me exactly how to properly proceed.

This is with perl 5.12.2 on 64-bit intel.

Thank you for your wisdom,

--bob

Replies are listed 'Best First'.
Re: use bytes vs packed data
by choroba (Cardinal) on May 02, 2011 at 16:25 UTC
    s/than documentation/than debugging purposes/
      Heh, sigh, yes, that's what I get for not actually copying and pasting.
Re: use bytes vs packed data
by John M. Dlugosz (Monsignor) on May 02, 2011 at 23:26 UTC
    I agree, the character U+00B0, which would fit in 8 bits, is expressed as UTF8 as the sequence of bytes C2 B0.

    I think it has to do with your string being marked as an 8-bit string and then some manipulation produces a UTF-8 string with the same sequence of code points.

    The details of when string operations on mixed inputs produce wide or narrow depends on the UTF8 pragma, as does whether string literals are wide or narrow.

    The more correct way to handle it is to us "encoding" functions. But I think you are getting this behavior from functions that are already written, right?

Re: use bytes vs packed data
by John M. Dlugosz (Monsignor) on May 03, 2011 at 04:57 UTC

      Interesting, it seems it would be but when I turn off use bytes and add use encoding::warnings I get the incorrect behavior with no warnings.

      The code fragment where this is happening is as follows, in case it helps. $conn is a Net::RabbitMQ connection object. $params is a FCGI parameters hash. $packed_data is the string that is getting munged.

      my $s = YAML::Dump($params); print "pack length " . length($s) . "\n"; my $packed_data = pack("N/aN/a", $s, $in); $conn->publish($channel, "rpc.$function", $packed_data, { exchange => $exchange_name }, { content_type => $type, correlation_id => $uuid_str, reply_to => $queue_name, });
        There is a function to find out whether the string is stored as utf8 or 8-bit, but I can't remember what it's called. You might try exploring the different values and see.
Re: use bytes vs packed data
by ikegami (Patriarch) on May 04, 2011 at 16:44 UTC

    On a hunch, check if the following helps:

    utf8::downgrade( $string_to_pass_to_rabbit );

    If so, that would indicate a bug in Rabbit. If not, let me know and I'll look into it.

    Update: It could also indicate a bad input. I would also appreciate the output of

    use Devel::Peek; Dump( $string_to_pass_to_rabbit );

      With use bytes turned off:

      SV = PV(0x1c71c9e8) at 0x1c202710 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1cab9940 "\0\0\3\302\261---\nCONTENT_LENGTH: [etc]

      With use bytes turned on:

      SV = PV(0xbfdd1e8) at 0xb734730 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0xbfeb710 "\0\0\3\261---\nCONTENT_LENGTH: [etc]

      With use bytes turned off and with calling utf8::downgrade on the string before dumping:

      SV = PV(0xd5ed9e8) at 0xd0d3710 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0xd98b100 "\0\0\3\261---\nCONTENT_LENGTH: [etc]

      Calls to Encode::is_utf8 are also tracking those values as noted by the dump.

      aHA.

      Resorted to reading the code in pp_pack.c, discovered that the UTFness of a packed string appears to be based on the UTFness of its components (which makes sense). The second string I'm packing here came from the Net::Async::FastCGI::Request stdin data, and was UTF8-flagged. So the question turns into how to convince that module to forget about encodings in its I/O - the code I'm building here is just relaying bits from one place to another and shouldn't be touching them.

        The second string I'm packing here came from the Net::Async::FastCGI::Request stdin data, and was UTF8-flagged.

        From what you say, it sounds like rabbit wants bytes. Did you encode the data? Why are you decoding an HTTP request in the first place?