Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a subroutine that takes a scalar (which I call payload) and prints it to an instance of IO::Socket::INET. The socket has had binmode($socket, ':raw') called for it.

My problem is that the payload variable may or may not have its utf8 flag set. If it does then I get the "Wide character in print' warning. I can silence the warning by always calling Encode::_utf8_off on the incoming data (or probably some "no warning" incantation) but I wasn't expecting this behaviour. I mistakenly thought that adding the :raw layer to the socket would set it up to accept any arbitrary bytes without worrying about what they were - in addition to telling it to not fiddle with the bytes on their way through. Clearly this isn't the case so I thought I would seeks others wisdom as to the most elegant solution in this case.

Is there a layer that does mean that? Do I just have to accept I need to call _utf8_off on each variable that might have utf8 data in it? Am I approaching the problem in the wrong way?

I did try pack('a*', $payload) but the utf8 flag is preserved in pack's return value.

Some slightly related questions if I may -

Are there any circumstances where pack('a*', $payload) does not give the same value was $payload?
Why is the OO binmode method available for IO::File but not IO::Handle? Is it not applicable to all handles?

Replies are listed 'Best First'.
Re: The utf8 flag and print()ing binary data
by ikegami (Patriarch) on Aug 31, 2010 at 18:27 UTC

    You have text you need to serialize into bytes. The process is called encoding.

    print $socket encode('UTF-8', $text);
    Or if the only thing you transmit on the socket is text, you could use binmode to add an :encoding layer.

    Are there any circumstances where pack('a*', $payload) does not give the same value was $payload?

    Since you're talking about internal storage formats, yes.

    $_=5; # IV -> PV $_="abc"; s/.//; # OOK=1 -> OOK=0

    Why is the OO binmode method available for IO::File but not IO::Handle? Is it not applicable to all handles?

    Weird. It makes no sense to me either. binmode doesn't apply to Dir handles, but IO::Handle is not a base class of IO::Dir, and the other methods of IO::Handle don't apply to directory handles either.

Re: The utf8 flag and print()ing binary data
by grantm (Parson) on Aug 31, 2010 at 21:49 UTC
    I wasn't expecting this behaviour. I mistakenly thought that adding the :raw layer to the socket would set it up to accept any arbitrary bytes without worrying about what they were

    The problem is that if the scalar has its utf8 flag set then it is not a string of bytes but a string of characters. The bytes required to represent those characters will depend on which encoding you want to use - hence the need to specify an encoding either with binmode or by explicitly calling encode.

    Turning off the utf8 flag as you suggest is not a good solution since it will cause the characters to be sent as the bytes used by Perl's internal string representation. This internal representation is conceivably subject to change (probably unlikely), but more importantly it's not a standard format (it's almost but not exactly UTF8).

      Thanks - that makes sense. Effectively my problem then is that the callers of my subroutine are passing in things other than bytes when they shouldn't be. As the code that calls my subroutine was written by me as well I shall have to give myself a good talking to!
Re: The utf8 flag and print()ing binary data
by ikegami (Patriarch) on Sep 01, 2010 at 06:03 UTC

    I mistakenly thought that adding the :raw layer to the socket would set it up to accept any arbitrary bytes without worrying about what they were

    You weren't mistaken. If you got that warning, it's that you didn't send bytes (≤255).

    $ cat a.pl use strict; use warnings; binmode(STDOUT, ':raw'); { utf8::downgrade( my $x = chr(0xE9) ); print $x; } { utf8::upgrade( my $x = chr(0xE9) ); print $x; } { my $x = chr(0x100); print $x; } $ perl a.pl | od -t x1 Wide character in print at a.pl line 6. 0000000 e9 e9 c4 80 0000004

    If you send text, you need to tell the socket how to convert the text into bytes. This type of serialisation is called character encoding. It can be done by calling encode or by using an :encoding layer.