in reply to use bytes and length problem

I would suspect that the problem resides in the way your $txt is created. I wrote up this piece of demo, to show different ways to form your string, and "use bytes" works all the time.

Hope this helps:

use strict; sub display { my $string = shift; use utf8;# as you can see from the result, whether to use utf8, or + bytes is irrelevant in this demo, as "U*' forces unicode any way print "\nchar semantics: "; print "$string "; printf "Length = %d, ", length($string); printf "Content = %vd\n", $string; use bytes; print "byte semantics: "; print "$string "; printf "Length = %d, ", length($string); printf "Content = %vd\n", $string; } my $encoded_string; my @decoded_list; { use bytes; print "=========================\n"; print "Case 1: create string from pack, with use bytes\n"; $encoded_string = pack("U*", 400, 306); display $encoded_string; @decoded_list = unpack("U*", $encoded_string); print join(".", @decoded_list), "\n"; } { use utf8; #not necessary in this case print "=========================\n"; print "Case 2: create string from pack, with use utf8\n"; $encoded_string = pack("U*", 400, 306); display $encoded_string; @decoded_list = unpack("U*", $encoded_string); print join(".", @decoded_list), "\n"; } { print "=========================\n"; print "Case 3: create string from \\x{}\n"; $encoded_string = "\x{190}\x{132}";#hex value of 400 and 306 display $encoded_string; @decoded_list = unpack("U*", $encoded_string); print join(".", @decoded_list), "\n"; }

Replies are listed 'Best First'.
Re: Re: use bytes and length problem
by Hofmator (Curate) on Mar 02, 2003 at 22:07 UTC
    For those of you who are too lazy to run pg's code, here's the output ;-)
    ========================= Case 1: create string from pack, with use bytes char semantics: IJ Length = 4, Content = 198.144.196.178 byte semantics: IJ Length = 4, Content = 198.144.196.178 400.306 ========================= Case 2: create string from pack, with use utf8 char semantics: IJ Length = 4, Content = 198.144.196.178 byte semantics: IJ Length = 4, Content = 198.144.196.178 400.306 ========================= Case 3: create string from \x{} char semantics: IJ Length = 2, Content = 400.306 byte semantics: IJ Length = 4, Content = 198.144.196.178 400.306

    Update I'm on perl 5.6.0 on solaris, so it's probably my own problem ;-). Full spec:

    -- Hofmator

      Now this is getting interesting :-), when I ran my code, I got this: (I am using AS 5.8.0, and the testing code for case 4 is at the end of this post).

      =========================
      Case 1: create string from pack, with use bytes
      
      char semantics: ƐIJ Length = 2, Content = 400.306
      byte semantics: ƐIJ Length = 4, Content = 198.144.196.178
      198.144.196.178
      =========================
      Case 2: create string from pack, with use buyes
      
      char semantics: ƐIJ Length = 2, Content = 400.306
      byte semantics: ƐIJ Length = 4, Content = 198.144.196.178
      400.306
      =========================
      Case 3: create string from \x{}
      
      char semantics: ƐIJ Length = 2, Content = 400.306
      byte semantics: ƐIJ Length = 4, Content = 198.144.196.178
      400.306
      =========================
      Case 4: read string from unicode file
      
      char semantics: 裴佳谷
       Length = 4, Content = 35060.20339.35895.10
      byte semantics: 裴佳谷
       Length = 10, Content = 232.163.180.228.189.179.232.176.183.10
      
      Also, I want to add a case to cover the situation where you read your string from file:
      { print "=========================\n"; print "Case 4: read string from utf8 file\n"; open(FILE, "<:utf8", "test.txt"); $encoded_string = <FILE>; display $encoded_string; }

        These are the results from my tests using 5.6.1. length seems to work fine for me whether use utf8/use bytes was inforce, but someone mentioned that there was a known problem with use bytes in 5.6 earlier in the thread, so I tried unpack with 'C*' which is cited in the docs as explicitely bypassing the unicode stuff.

        #! perl -sw use strict; use LWP::Simple; my $content = get( 'http://www.columbia.edu/kermit/utf8.html' ); { use utf8; my $c_len = length $content; my @c_bytes = unpack 'C*', $content; my @c_chars = unpack 'U*', $content; print "Charwise - length:$c_len; 'C*':", scalar @c_bytes, "; 'U*': +", scalar @c_chars, $/; } { use bytes; my $b_len = length $content; my @b_bytes = unpack 'C*', $content; my @b_chars = unpack 'U*', $content; print "Bytewise - length:$b_len; 'C*':", scalar @b_bytes, "; 'U*': +", scalar @b_chars, $/; } { open JUNK, '>', 'junk' or die $!; binmode(JUNK); print JUNK $content; close JUNK; print 'Actual (from os): ', -s 'junk', $/; } __END__ C:\test>239788 Charwise - length:31946; 'C*':31946; 'U*':28621 Bytewise - length:31946; 'C*':31946; 'U*':28621 Actual (from os): 31946

        Examine what is said, not who speaks.
        1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
        2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
        3) Any sufficiently advanced technology is indistinguishable from magic.
        Arthur C. Clarke.
        I got this too, under RH8