saintmike has asked for the wisdom of the Perl Monks concerning the following question:

You're familiar with the use bytes pragma, right? Without it, perl operates with unicode characters, as in
# prints '1' print length("\x{03c5}"), "\n";
while with use bytes, it falls back to byte semantics:
# prints '2' use bytes; print length("\x{03c5}"), "\n";
Now what if I have a module Foo.pm that does a simple calculation:
package Foo; sub len { return length("\x{03c5}"); } 1;
and I want to impose use bytes semantics on it without modifying its code? Things like
BEGIN { package Foo; use bytes; } use Foo; package main; print Foo::len(), "\n";
won't work because use bytes modifies the behaviour in its lexical scope. Ideas, anyone?

Replies are listed 'Best First'.
Re: impose 'use bytes' on another package
by ikegami (Patriarch) on Apr 06, 2006 at 06:16 UTC

    In a manner of thinking, Perl has two kinds of strings: strings of characters and strings of bytes. It seems your len function expects to be working on strings of bytes, yet you have a string of characters (since 0x03C5 is outside the range of bytes). Why don't your convert your string of characters into a string of bytes?

    Converting from strings to bytes is known as "encoding", and Encode is the module to do it. The question you have to answer is: Which encoding to you wish to use? You could, for example, encode using utf8:

    $octets = encode("utf8", $string);
    In context, we get:
    use Encode qw( encode ); sub string_to_literal { local $_ = @_ ? $_[0] : $_; s/(.)/ my $o = ord($1); if ($1 eq '"' ) { '\\"' } elsif ($1 eq '\\' ) { '\\\\' } elsif ($1 < 0x20 || $1 >= 0x7F) { sprintf('\\x{%X}', $o) } else { $1 } /eg; return qq{"$_"}; } sub octet_dump { return join ' ', map { sprintf('%02X', ord($_)) } map /(.)/g, @_ ? $_[0] : $_; } $string = "\x{03c5}"; print("\$string is ", length($string), " chars long: "); print(string_to_literal($string), "\n"); $octets = encode("utf8", $string); print("\$octets is ", length($octets), " bytes long: "); print(octet_dump($octets), "\n");

    outputs

    $string is 1 chars long: "\x{3C5}" $octets is 2 bytes long: CF 85

    Both $string and $octects contains "υ", except the character is in Perl's internal character format in $string and in utf8 in $octects.

      It seems your len function expects to be working on strings of bytes ...
      Actually, it's the other way around, but I wasn't interested in dynamically converting bytes to characters or vice versa.

      I was thinking it should be possible (without reverting to dirty tricks like eval-ing the code) to switch between unicode string and byte string interpretation at run time (or at least at compile time) in a separate module, without modifying the module code.

Re: impose 'use bytes' on another package
by codeacrobat (Chaplain) on Apr 05, 2006 at 23:01 UTC
    Lets try a simpler problem first. An evaluation of the main code of Foo.pm
    perl -e 'use bytes; eval q(package Foo; sub len{ length "\x{03c5}"});p +rint Foo::len()' 2
    I always thought do "Foo.pm" is the same as eval `cat Foo.pm`. But
    $ perl -e 'use bytes; do "Foo.pm";print Foo::len()' 1
      Oops forget to post the workaround. All you have to do is get rid of the 1; in the Foo.pm
      and eval the remaining content of the (no longer)Module.
      $code = `cat Foo.pm`;$code =~ s/\n1;//s; eval $code;
      Use it if a quick'n dirty solution is right for you. Otherwise I hope other monks come up with a cleaner solution.

        You don't have to remove trailing true value and it'd have been nicer if you avoided making a call to the shell and cat when there are perfectly good perl functions for such a thing. This is also a quick and dirty solution but it isn't as craptacular as yours.

        local @ARGV = "Foo.pm"; # TODO: make this search @INC local $/; eval "#line Foo.pm 1\nuse bytes;" . <>; die $@ if $@;

        ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

        Hardly practical if the module is a few thousands lines long, contaisn XS-code and calls in tons of other modules.

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law