William G. Davis has asked for the wisdom of the Perl Monks concerning the following question:

Update 12/7/03:

Apparently, the POE people had to deal with this very same problem. See how they did it.


Hi Monks.

At the moment, I'm working on some network libraries and attempting to add to them support for Unicode. My problem is that Perl's system IO functions--the ones I'm using, sysread() and syswrite()--all take the length to read/write in bytes, and yet I can't seem to find any portable way to get that information.

As of 5.6.1, strings are stored internally as UTF-8 and all built-in functions that purport to operate on characters do operate on characters; namely length(), which now returns the length in characters as opposed to the length in bytes.

To force length() to return the length in bytes, perlunicode says you can use the bytes pragma, as this example illustrates:

#!/usr/bin/perl -w use 5.6.1; use strict; # three smiley faces: my $string = "\x{263a}\x{263a}\x{263a}"; printf("%s: %d characters\n", $string, length $string); { use bytes; printf("%s: %d bytes\n", $string, length $string); }

That's all well and good, but unfortunately, the bytes pragma was introduced as of 5.6.1, and my libraries are supposed to support perl back to 5.005. I can't wrap "use bytes;" in an eval block, so what can I do?

Replies are listed 'Best First'.
Re: Perl + Unicode == Networking Woes
by liz (Monsignor) on Nov 24, 2003 at 20:48 UTC
    You could create your own "bytes" pragma for older versions, so that you don't get any errors while executing with these older versions of Perl.

    BEGIN { unless (eval {require bytes}) { $INC{'bytes.pm'} = 1; eval "sub bytes::unimport{ undef }"; # 5.00503 doesn't inherit + UNIVERSAL::unimport } } use bytes; # always works no bytes; # also always works ;-)

    Liz

      Very nice trick liz, I am learning something new everyday! :-D

      This works perfectly! Thank you!

Re: Perl + Unicode == Networking Woes
by Roy Johnson (Monsignor) on Nov 24, 2003 at 20:49 UTC
    Do you need to know the length, or can you just read chunks until it's all read in? And as for writing, you don't have to specify a length. If you don't, it will write the whole scalar. (perldoc -f syswrite)

    I guess reading could be a problem, if you end up reading a half-character. If you use read instead, I think you'll be golden.


    The PerlMonk tr/// Advocate
      UPDATE:

      No. The size argument to syswrite is only optional as of 5.6.1. And besides, length() is used in many other places (like for generating a content length-style data heading).

Somewhat off topic, but...
by William G. Davis (Friar) on Nov 24, 2003 at 21:21 UTC

    Can someone please explain to me why they never added a sizeof() function to perl in 5.6.1? I mean, doesn't Perl know the size of an SV? Wouldn't that have been simpler? A function call can easily be wrapped in an eval block, as opposed to use bytes, which can't.

Re: Perl + Unicode == Networking Woes
by Roger (Parson) on Nov 24, 2003 at 20:36 UTC
    Why not replace your sysread and syswrite calls to use read and write instead?

    Updated: Opps, I have realized my silly mistake on the syswrite/write bit. Thanks to monks who have pointed it out. I guess it's still too early in the morning, and I need more coffee. :-p

      Because read() is almost just a buffered sysread() (perldoc -f read):

      read FILEHANDLE,SCALAR,LENGTH,OFFSET
      read FILEHANDLE,SCALAR,LENGTH
              Attempts to read LENGTH bytes of data into variable SCALAR from
              the specified FILEHANDLE. Returns the number of bytes actually
              read, "0" at end of file, or undef if there was an error. SCALAR
              will be grown or shrunk to the length actually read. If SCALAR
              needs growing, the new bytes will be zero bytes. An OFFSET may
              be specified to place the read data into some other place in
              SCALAR than the beginning. The call is actually implemented in
              terms of stdio's fread(3) call. To get a true read(2) system
              call, see "sysread".
      

      and write() is absolutely nothing like syswrite(). It's used for printing formats (perldoc -f write):

      write FILEHANDLE
      write EXPR
      write   Writes a formatted record (possibly multi-line) to the specified
              FILEHANDLE, using the format associated with that file. By
              default the format for a file is the one having the same name as
              the filehandle, but the format for the current output channel
              (see the "select" function) may be set explicitly by assigning
              the name of the format to the "$~" variable.
      
              Top of form processing is handled automatically: if there is
              insufficient room on the current page for the formatted record,
              the page is advanced by writing a form feed, a special
              top-of-page format is used to format the new page header, and
              then the record is written. By default the top-of-page format is
              the name of the filehandle with "_TOP" appended, but it may be
              dynamically set to the format of your choice by assigning the
              name to the "$^" variable while the filehandle is selected. The
              number of lines remaining on the current page is in variable
              "$-", which can be set to "0" to force a new page.
      
              If FILEHANDLE is unspecified, output goes to the current default
              output channel, which starts out as STDOUT but may be changed by
              the "select" operator. If the FILEHANDLE is an EXPR, then the
              expression is evaluated and the resulting string is used to look
              up the name of the FILEHANDLE at run time. For more on formats,
              see the perlform manpage.
      
              Note that write is *not* the opposite of "read". Unfortunately.
      
      

      And regardless, write() too uses Perl's stdio buffering, which means even if it did what its name implies, it couldn't be used for robust network IO.