mrd has asked for the wisdom of the Perl Monks concerning the following question:

Honoured Monks,

I'm having troubble with "length" of utf-8 strings. As far as I see, "length" returns correctly the length, in bytes, of a hard-coded string but not of a string read from a file (or STDIN).

The following script correctly prints : 2

use utf8; $x="\x{0103}"; use bytes; print length($x);
but not if $x is read from STDIN:
use utf8; binmode(STDIN,":utf8"); $x=<>; chomp $x; use bytes; print length($x);
In this case the output is "1".

My environment: WinXP Home Edition, Active Perl 5.8.0.

Thanks for any ideeas.

mrd

Replies are listed 'Best First'.
Re: length in bytes of utf8 string
by Thelonius (Priest) on Jun 27, 2003 at 10:43 UTC
    Well here's a little gotcha (from perldoc perlunicode):
    Unicode characters can also be added to a string by using the \x{...} notation. The Unicode code for the desired character, in hexadecimal, should be placed in the braces. For instance, a smiley face is \x{263A}.This encoding scheme only works for characters with a code of 0x100 or above.
    You could say $x = pack("U", 0xfc);
      Thanks. That should be bolded in perlunicode too.
Re: length in bytes of utf8 string
by graff (Chancellor) on Jun 27, 2003 at 08:48 UTC
    Using Perl 5.8.0 on a SuSE Linux system, I get "2" printed by both versions of the test (hard-coded wide character, and piping a 2-byte utf8 code into stdin).

    In what sense have you made sure that stdin is actually receiving two bytes of character data (not counting the line termination)? I did it as follows:

    perl -e 'binmode( STDOUT, ":utf8" ); print "\x{00A1}\n";' | perl -e 'use utf8; binmode(STDIN,":utf8"); $x=<>; $x=~s/[\r\n+]//; use bytes; print length($x),$/;'
    and this gave me the correct answer (2).
      5.8.0 on cygwin - the same.
Re: length in bytes of utf8 string
by diotalevi (Canon) on Jun 27, 2003 at 08:40 UTC

    It would be immensely interesting to see what Devel::Peek's Dump($x) sees after $x population.

      Test 1 - char is hard-coded in script:

      use utf8; use Devel::Peek; $x="ü"; #<-- unicode char here print Dump($x); use bytes; print length($x); __END__

      Output (ok):

      SV = PV(0x15d5584) at 0x1a45848 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x15d91dc "\303\274"\0 [UTF8 "\x{fc}"] CUR = 2 LEN = 3 2

      Test 2 - char code is hard-coded in script:

      use utf8; use Devel::Peek; $x="\x{00fc}"; print Dump($x); use bytes; print length($x); __END__

      Output (not ok):

      SV = PV(0x15d5584) at 0x1a45848 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x15d91dc "\374"\0 CUR = 1 LEN = 2 1

      Test 3 - char is read from file which contains only one char (0x00cf) (0x00fc):

      use utf8; use Devel::Peek; open(IN, "uni.txt"); binmode(IN,":utf8"); $x=<IN>; chomp($x); print Dump($x); use bytes; print length($x); __END__

      Output (ok):

      SV = PV(0x15d5584) at 0x1a4583c REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x1a78eec "\303\274"\0 [UTF8 "\x{fc}"] CUR = 2 LEN = 80 2

      I could've sworn yesterday that "Test 4" doesn't work. Have to investigate a little more. What's up with "Test 2"?

      Test 4 - Reading "directly" from STDIN (command prompt) was aparently wrong.

      Thanks,

      mrd

      update: This is weird:

      use utf8; use Devel::Peek; #$x="\x{00fc}"; #<-- not ok!! #$x = "ü"; #<-- char above. ok #$x="\x{0103}"; #<-- ok #$x = "ă"; # char above. ok. print Dump($x); use bytes; print length($x); __END__

      I edit my files (text & code) with vim 6.1. Have "encoding=utf-8".

        The character 0x00fc is encoded as 0xfc. Just because you like to write it with leading zeros doesn't mean that wouldn't be the wrong way to store it. Its supposed to do that.