comment on

Test 1 - char is hard-coded in script:

use utf8;
use Devel::Peek;
$x="ü";         #<-- unicode char here
print Dump($x);
use bytes;
print length($x);
__END__
[download]

Output (ok):

SV = PV(0x15d5584) at 0x1a45848
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x15d91dc "\303\274"\0 [UTF8 "\x{fc}"]
  CUR = 2
  LEN = 3
2
[download]

Test 2 - char code is hard-coded in script:

use utf8;
use Devel::Peek;
$x="\x{00fc}";
print Dump($x);
use bytes;
print length($x);
__END__
[download]

Output (not ok):

SV = PV(0x15d5584) at 0x1a45848
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x15d91dc "\374"\0
  CUR = 1
  LEN = 2
1
[download]

Test 3 - char is read from file which contains only one char ~~(0x00cf)~~ (0x00fc):

use utf8;
use Devel::Peek;
open(IN, "uni.txt");
binmode(IN,":utf8");
$x=<IN>;
chomp($x);
print Dump($x);
use bytes;
print length($x);
__END__
[download]

Output (ok):

SV = PV(0x15d5584) at 0x1a4583c
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x1a78eec "\303\274"\0 [UTF8 "\x{fc}"]
  CUR = 2
  LEN = 80
2
[download]

I could've sworn yesterday that "Test 4" doesn't work. Have to investigate a little more. What's up with "Test 2"?

Test 4 - Reading "directly" from STDIN (command prompt) was aparently wrong.

Thanks,

mrd

update: This is weird:

use utf8;
use Devel::Peek;

#$x="\x{00fc}";    #<-- not ok!!

#$x = "ü";        #<-- char above. ok

#$x="\x{0103}";    #<-- ok

#$x = "ă";        # char above. ok.

print Dump($x);

use bytes;
print length($x);
__END__
[download]

I edit my files (text & code) with vim 6.1. Have "encoding=utf-8".

In reply to Re: Re: length in bytes of utf8 string by mrd
in thread length in bytes of utf8 string by mrd

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.