Re: Re: length in bytes of utf8 string

Test 1 - char is hard-coded in script:

use utf8;
use Devel::Peek;
$x="ü";         #<-- unicode char here
print Dump($x);
use bytes;
print length($x);
__END__
[download]

Output (ok):

SV = PV(0x15d5584) at 0x1a45848
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x15d91dc "\303\274"\0 [UTF8 "\x{fc}"]
  CUR = 2
  LEN = 3
2
[download]

Test 2 - char code is hard-coded in script:

use utf8;
use Devel::Peek;
$x="\x{00fc}";
print Dump($x);
use bytes;
print length($x);
__END__
[download]

Output (not ok):

SV = PV(0x15d5584) at 0x1a45848
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x15d91dc "\374"\0
  CUR = 1
  LEN = 2
1
[download]

Test 3 - char is read from file which contains only one char ~~(0x00cf)~~ (0x00fc):

use utf8;
use Devel::Peek;
open(IN, "uni.txt");
binmode(IN,":utf8");
$x=<IN>;
chomp($x);
print Dump($x);
use bytes;
print length($x);
__END__
[download]

Output (ok):

SV = PV(0x15d5584) at 0x1a4583c
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x1a78eec "\303\274"\0 [UTF8 "\x{fc}"]
  CUR = 2
  LEN = 80
2
[download]

I could've sworn yesterday that "Test 4" doesn't work. Have to investigate a little more. What's up with "Test 2"?

Test 4 - Reading "directly" from STDIN (command prompt) was aparently wrong.

Thanks,

mrd

update: This is weird:

use utf8;
use Devel::Peek;

#$x="\x{00fc}";    #<-- not ok!!

#$x = "ü";        #<-- char above. ok

#$x="\x{0103}";    #<-- ok

#$x = "ă";        # char above. ok.

print Dump($x);

use bytes;
print length($x);
__END__
[download]

I edit my files (text & code) with vim 6.1. Have "encoding=utf-8".

Comment on Re: Re: length in bytes of utf8 string Select or Download Code

Replies are listed 'Best First'.
Re: Re: Re: length in bytes of utf8 string by diotalevi (Canon) on Jun 27, 2003 at 10:06 UTC
The character 0x00fc is encoded as 0xfc. Just because you like to write it with leading zeros doesn't mean that wouldn't be the wrong way to store it. Its supposed to do that.	[reply]
Re: Re: Re: Re: length in bytes of utf8 string by mrd (Beadle) on Jun 27, 2003 at 10:21 UTC
I don't understand what you mean. Leading zeros or not, the length is wrong!	[reply]
Re: Re: Re: Re: Re: length in bytes of utf8 string by diotalevi (Canon) on Jun 27, 2003 at 10:49 UTC
That's because "\374" eq "\xfc" eq "\x{00fc}" eq chr(252) eq chr(0xfc) eq chr(0374). That character is one byte long.	[reply]
Re: Re: Re: Re: Re: Re: length in bytes of utf8 string by mrd (Beadle) on Jun 27, 2003 at 11:08 UTC
Re: Re: Re: Re: Re: Re: Re: length in bytes of utf8 string by diotalevi (Canon) on Jun 27, 2003 at 11:10 UTC
Some notes below your chosen depth have not been shown here